The Summit drew over 1000 attendees this year, with scores of presentations and hundreds of AI leaders from large companies as well as many startups.
Every September since 2019, the AI HW Summit in the Bay Area has been the focal point for new technologies around AI. While the event, hosted by UK-based Kisaco, started with semiconductors, it has continually expanded its focus to include software, models, networking, and full data center optimization. Next year it will return as the AI Infra Summit, acknowledging that AI has become a full-stack endeavor that consumes entire data centers.
It may surprise some that Nvidia did not present at the event. They don’t see the need, since everyone knows who they are and how fast their GPUs are.
Here’s a few insights from the event.
A Food Fight Erupted Over the Claim, “Fastest Inference on the Planet”
Seems like the battle over inference services is really heating up, with Cerebras, Groq, and Samba Nova all claiming to be the fastest available tokens-as-a-service. Now, I am fairly confident that nobody is lying here, but let’s just say each company is cherry-picking the size of the Llama3.1 model they want to tout. And they are mostly referencing tests run by Artificial Analysis, which has results on its website.
Here’s Cerebras’ benchmark results:
And Samba Nova claims the fastest 405B parameter Llama 3.1. There were a lot of discussions on why Groq and Cerebras has not (yet) run this model. It could be that they don’t have enough SRAM on their systems to do as well. Or they just don’t have enough time. (note: OctoAI is reportedly in acquisition discussion now with Nvidia.)
And here are Groq’s results. Groq seems to be picking up steam, and landed $640M investment from Blackrock and others. The company’s development cloud has quickly grown to over 360,000 developers building AI on GroqCloud. Groq also landed a large data center deal with Aramco to build a giant data center in Saudi Arabia that could grow to some 200,000 Language Processing Units.
So, I went to the AA website (no, not THAT AA, though perhaps I should) and found a very interesting chart that proclaims Cerebras the winner of 70B in performance and price per million tokens at just under 50 cents.
Confused? So am I. But here’s the deal. Artificial Analysis runs a wide variety of models on whatever hardware a model service provider uses. They don’t do any tuning, and Nvidia is only represented by the providers who use an unspecified Nvidia GPU. They don’t disclose how many accelerators were used in the runs, nor which lower-level software was used.
Inference will become a larger market than training AI, and all three of these companies have demonstrated a massive leap forward in lowering the costs of using AI in real-world applications. Nice job! Now, if you could just publish some MLPerf results, we’d all feel better. AA provides a great service, but does not replace benchmarks run by the hardware providers themselves such as MLPerf, whose benchmarks are all peer-reviewed prior to release.
Optical is the Next Big Thing
How many times have we heard that? Its always coming “soon”. Yes, optical interconnects are widely used for rack-to-rack connectivity in modern data centers to get around the length limitations of copper and the need for retimers. But optical is rarely used within a rack, where the cable lengths are not a problem for the cheaper copper solutions.
But that may be about to change. Celestial AI is developing an elegant and performant design they were touting at the conference. Their approach could help solve the “memory wall” GPUs contend with today, by providing access to over 33 TB of shared HBM memory space. They claim they can lower costs by over 25 times, power by 8 times, and RDMA latency by 5 times, all while providing over 4 times better bandwidth. We will be watching these guys closely as they finish engineering their 1st generation.
What Ever Happened to Analog Computing?
There is a lot of research going on at IBM, Intel, and elsewhere to develop a performant analog in-memory compute solution. It looks great in PowerPoint, but the D-to-A converters add latency, and the size of memory is not conducive to running the LLMs that are driving billions of dollars of investment these days.
Enter Mentium, a startup out of UC Santa Barbara, that is building a platform that combines a digital processor with an in-memory-compute analog processor they believe provides the best of both worlds.
As an important aside, Mentium switched from in-house EDA tool hosting to the Synopsys Cloud, hosted on Microsoft Azure. The switch save the company months of development time and costs, while reducing the complexity they were facing using on-prem EDA tools.
The Mega-NIC From Enfabrica Is Coming Soon
One of Nvidia’s greatest assets is NVLink, which interconnects up to 512 GPUs at 100 GB/s per link, and is 14 times faster than PCIe. But what about the “rest of the story”; how do you connect the GPU nodes? It takes a lot of switches.
Enfabrica came out of stealth at last year’s AI HW Summit, with backing by Nvidia and a who’s who of Venture Capitalists. This year, the company is closer to productization, and expanded their value proposition to include failover features so important to AI Training.
When adoption begins in 2025, we expect Enfabrica to become a darling of the industry, and they should see significant adoption.
Other Stories Worth Telling
Microsoft, AWS, and Meta all shared insights at the data center level; more than we can fit into this blog. But their presenations and others reinforced the message that AI is now data-center-scale, with tens of thousands of GPUs. Meta forecasted a ten-fold increase in cluster size by 2030; think millions of GPUs. And while AMD and Intel re-told their stories and roadmaps (nothing new to see here; move along), there were a lot of great stories told by entrepreneurs at the event. Here’s a few:
Positron:
Furiosa AI:
The Japanese startup, Furiosa AI, discussed their approach to efficient AI using Tensor Contraction, not Matmul, as a primitive operation. In data centers whose typical current power density is on the order of 15KW/Rack, this could be interesting, although many other companies such as Hailo and D-Matrix are on a similar track of power efficiency using SRAM for weights.
Broadcom and the Ultra Ethernet Consortium
Now that Nvidia has joined the party, there is little doubt that the UEC will become a wide-spread networking standard when it ships sometime in 2026 (?). While we are confident UEC is the next standard, that does not mean that Nvidia will stop innovating its networking, including NVLink and InfiniBand, as well as its own Spectrum Ethernet.
Conclusions
Whew! That was a lot of slides and four full days of companies striving for AI efficiency. And the focus has expanded far above the chips that power AI. For example, Meta showed failure data and a three-pronged strategy to deal with the certainty of failures: Avoid failures, detect failures, and tolerate the inevitability.
If you only have time to attend two conferences next year, Nvidia GTC and the AI Infra Summit (new name) are the two you should attend.
Hope to see you there!