The story about DeepSeek has disrupted the prevailing AI narrative, impacted the markets and spurred a media storm: A large language model from China competes with the leading LLMs from the U.S. – and it does so without requiring nearly the costly computational investment. Maybe the U.S. doesn’t have the technological lead we thought. Maybe heaps of GPUs aren’t necessary for AI’s special sauce.
But the heightened drama of this story rests on a false premise: LLMs are the Holy Grail. Here’s why the stakes aren’t nearly as high as they’re made out to be and the AI investment frenzy has been misguided.
Amazement At Large Language Models
Don’t get me wrong – LLMs represent unprecedented progress. I’ve been in machine learning since 1992 – the first six of those years working in natural language processing research – and I never thought I’d see anything like LLMs during my lifetime. I am and will always remain slackjawed and gobsmacked.
LLMs’ uncanny fluency with human language confirms the ambitious hope that has fueled much machine learning research: Given enough examples from which to learn, computers can develop capabilities so advanced, they defy human comprehension.
Just as the brain’s functioning is beyond its own grasp, so are LLMs. We know how to program computers to perform an exhaustive, automatic learning process, but we can hardly unpack the result, the thing that’s been learned (built) by the process: a massive neural network. It can only be observed, not dissected. We can assess it empirically by checking its behavior, but we can’t understand much when we peer inside. It’s not so much a thing we’ve architected as an impenetrable artifact that we can only test for effectiveness and safety, much the same as pharmaceutical products.
Great Tech Brings Great Hype: AI Is Not A Panacea
But there’s one thing that I find even more amazing than LLMs: the hype they’ve generated. Their capabilities are so seemingly humanlike as to inspire a prevalent belief that technological progress will shortly arrive at artificial general intelligence, computers capable of almost everything humans can do.
One cannot overstate the hypothetical ramifications of achieving AGI. Doing so would grant us technology that one could install the same way one onboards any new employee, releasing it into the enterprise to contribute autonomously. LLMs deliver a lot of value by generating computer code, summarizing data and performing other impressive tasks, but they’re a far distance from virtual humans.
Yet the far-fetched belief that AGI is nigh prevails and fuels AI hype. OpenAI optimistically boasts AGI as its stated mission. Its CEO, Sam Altman, recently wrote, “We are now confident we know how to build AGI as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents ‘join the workforce’…”
AGI Is Nigh: A Baseless Claim
“Extraordinary claims require extraordinary evidence.”
–Karl Sagan
Given the audacity of the claim that we’re heading toward AGI – and the fact that such a claim could never be proven false – the burden of proof falls to the claimant, who must collect evidence as wide in scope as the claim itself. Until then, the claim is subject to Hitchens’s razor: “What can be asserted without evidence can also be dismissed without evidence.”
What evidence would suffice? Even the impressive emergence of unforeseen capabilities – such as LLMs’ ability to perform well on multiple-choice quizzes – must not be misinterpreted as conclusive evidence that technology is moving toward human-level performance in general. Instead, given how vast the range of human capabilities is, we could only gauge progress in that direction by measuring performance over a meaningful subset of such capabilities. For example, if validating AGI would require testing on a million varied tasks, perhaps we could establish progress in that direction by successfully testing on, say, a representative collection of 10,000 varied tasks.
Current benchmarks don’t make a dent. By claiming that we are witnessing progress toward AGI after only testing on a very narrow collection of tasks, we are to date greatly underestimating the range of tasks it would take to qualify as human-level. This holds even for standardized tests that screen humans for elite careers and status since such tests were designed for humans, not machines. That an LLM can pass the Bar Exam is amazing, but the passing grade doesn’t necessarily reflect more broadly on the machine’s overall capabilities.
Pressing back against AI hype resounds with many – more than 787,000 have viewed my Big Think video saying generative AI is not going to run the world – but an exhilaration that borders on fanaticism dominates. The recent market correction may represent a sober step in the right direction, but let’s make a more complete, fully-informed adjustment: It’s not only a question of our position in the LLM race – it’s a question of how much that race matters.