By acquiring ZT Systems, a favorite Hyperscale systems designer, AMD addresses a fundamental weakness it faces versus Nvidia: AI requires a complete system design, not just a fast chip. What else must it do to grab a 20% share?
At a recent investment conference hosted by KeyBanc, I presented my thesis on whether AMD could catch up with Nvidia for AI. Looking only at specs, AMD has already caught up with Nvidia on the chip front. However, these performance claims are not AI performance; they are raw math benchmarks, not AI benchmarks. And these benchmarks are at a chip level, not at the rack level Nvidia touts with the Blackwell-based NVL-72. AMD CTO Mark Papermaster committed that AMD would publish MLPerf last year, so I expect this issue will be addressed soon. So, let’s talk about the other three areas AMD needs to invest and innovate in to become a serious AI contender (>10% share).
The Three Gaps
I noted three areas AMD needs to invest in to close the gap with the AI leader: Systems, Software, and Networking. On the systems side, buying New Jersey-based ZT Systems will help, primarily with cooling solutions. ZT’s customers are some of the largest hyperscalers, and they know what is needed to compete at the rack level. Selling off the actual hardware business and retaining the critical engineers is an intelligent way to build up AMD systems mojo. However, the transition could easily take a couple of years. While ZT helps AMD on the Systems side, Networking and Software remain a work in progress:
Networking
However, the systems gap AMD must address is a networking problem, which ZT does not address. Nvidia has three networking technologies: NVLink, Ethernet, and InfiniBand. NVLink latency is about 500-1000 times lower than Ethernet, which AMD uses for multi-system solutions, and it is roughly 18 times faster in terms of bandwidth. When training very large language models, that’s a show-stopper for AMD (and Intel, by the way).
So, the most critical missing element here is the GPU-to-GPU NVLink interconnect that can scale beyond an 8 GPU server. Above is pictured an NVidia NVL72 Blackwell system, with direct connection between all 72 GPUs for minimum latency needed for very large language model training and inference. AMD, Intel, and other major tech companies are collaborating on an open alternative to NVLink technology called Ultra Accelerator Link (UALink).
UALink specification will support up to 1,024 accelerators in a cluster, more than NVLink/NVSwitch 5.0, but its going to take a couple years to get here. The spec is nearly finished, and will be updated later this year. And then it will take a couple more years to produce silicon. So call it a 2027 thing. NVLink 6.0 will be shipping by then, probably at twice the speed of 5.0. But at least the rest of the crowd (AMD, Intel, etc) will have something that more closely aligns with customer demand to interconnect thousands of accelerators.
Software
AMD realizes it is far behind Nvidia on the software front, and has recently acquired Silo.ai to help shore up this vital area. Silo is a European based software company that provides AI software for financial services, aviation, healthcare, manufacturing, consumer goods, and telecommunications. Silo is (or was) Europe’s largest privately-held AI software company, and will certainly help AMD begin to close this gap. However, the biggest software problem for AMD remains CUDA, the low-level libraries that makes Nvidia GPUs sing, as well as performance-enhancing software like Triton inference server and TensorRT-LLM, which can more than quadruple performance of Nvidia GPUs. We suspect addressing this gap is a high priority for AMD.
Conclusions
AMD is on an arduous path to address the competitive gaps for large-scale AI in the three areas we have identified. Yes, Nvidia has a lot more software beyond that we have noted here (e.g., Omniverse and the slew of technologies for Physical AI), but the rest of the story will flesh out over time as AMD builds an ecosystem and fixes its networking deficiencies. (We note that the ZLUDA project meant to provide a CUDA remedy has been cancelled.) It is going to take time, and perhaps another acquisition in the networking space, unless it wants to cede that revenue to Broadcomm.
Disclosures: This article expresses the opinions of the author and is not to be taken as advice to purchase from or invest in the companies mentioned. My firm, Cambrian-AI Research, is fortunate to have many semiconductor firms as our clients, including BrainChip, Cadence, Cerebras Systems, Esperanto, IBM, Intel, NVIDIA, Qualcomm, Graphcore, Groq, SImA,ai, Synopsys, Tenstorrent, Ventana Microsystems, and scores of investors. We have no investment positions in any of the companies mentioned in this article. For more information, please visit our website at https://cambrian-AI.com.