On June 16, the Chinese AI lab, Z.ai released GLM-5.2. It is an open weights model under an MIT license, which means anyone can download it, modify it and run it commercially with no restriction. It’s performance is incredibly impressive. It scores 81.0 on Terminal-Bench 2.1, which is one of the most commonly used model performance benchmarks. What’s worth noting is the rapid rate of improvement. The previous version of GLM, version 5.1. scored a mere 62 on the same benchmark. This is a serious jump and it’s been achieved in weeks, not years.
GLM 5.2 continues to dazzle with its performance on benchmarks. It scores 62.1 on SWE-bench Pro, which means that it edges past GPT-5.5. On FrontierSWE it trails the widely-considered leader, Opus 4.8, by a mere point. This new Chinese model carries a one million token context window that holds up across long agentic sessions. And it costs roughly one sixth of what the leading American closed model charges per token.
Read that paragraph again. An open model you can run yourself now trades blows with the frontier on the tasks that matter most to engineers. At a sixth the cost.
China’s Winning Model
This is not an anomaly. This is China’s strategy. Real performance, fast iteration and improvement and costs which will put a damper on every revenue projection sold to investors by US-based AI companies.
If my conclusion was supported merely by reporting third-party benchmarks someone else ran, you would be justified in your skepticism. But I have been using Chinese models for hundreds of problems across many machines and a variety of workflows. My current favorite “jack of all trades” model is DeepSeek-V4, and its cheaper v4-flash cousin. In my own work, outside of the horrendously expensive top Opus tier, it has been the most broadly capable model I use. V4-Pro is a 1.6 trillion parameter mixture of experts model that activates 49 billion parameters per token. It posts 80.6 percent on SWE-bench Verified benchmark. It costs about 87 cents per million output tokens. That is roughly one thirtieth of frontier pricing. A smidgeon over 3%! The weights are open. You can do with it as you please. Please note that I am not describing a research curiosity. This is a model that does much of my real work.
Earlier I mentioned speed. Now let’s look at the cadence of this brave new open frontier. GLM-5 arrived in February. GLM-5.1 arrived in March and lifted the internal coding score from 35.4 to 45.3, a 28 percent jump in a single point release. GLM-5.2 arrived in June and nearly doubled the Terminal-Bench result again. Three steps. Four months. Each step was trained on Chinese silicon. There is still some argument about whether all of it was Nvidia-free, but I am inclined to believe that Chinese labs are now able to deliver frontier-class models on an entirely domestic stack.
All this speed means that the open frontier is not crawling toward the closed frontier. It is sprinting. In 2023 open models were two years behind. In 2024 one year. In 2025 six months. Today the gap on the benchmarks that decide real engineering work is measured in mere weeks.
Cost Curves And The Price of Intelligence
Compare this to the cost of intelligence itself. For three years the price of a unit of model output fell roughly tenfold each year. A GPT-4 class result that cost twenty dollars per million tokens in late 2022 costs around forty cents today. That is close to a thousandfold decline. It is one of the fastest cost collapses in the computing.
But that curve stalled this year. Not because the technology stopped improving. But because of the supply chain. With the Iran war and the AI datacenter boom, the world ran short of memory. DRAM and high bandwidth memory went into acute shortage. Supplier inventories fell from months of stock to weeks. Server memory prices are on track to double by the end of 2026. The per token price kept drifting down while the cost to own or rent the hardware underneath it climbed. The deflation paused for a supply reason. Not a technology, physics or demand reason.
Fear The Coming Surprise
But this pause is not permanent. What happens when the dam breaks? Here is how two surprises landing at once can spell doom for many optimistic investors in massive datacenters.
The first surprise is new capacity coming online. The memory shortage is a cycle, not a ceiling. Fabs are being built. When that supply materializes, hardware costs fall back toward trend and the thousandfold curve picks up where it left off. Intelligence resumes getting cheaper on schedule, and the pause looks in hindsight like a single bad period on an otherwise steep slope.
The second surprise is the advent of the edge. While the cloud waits on memory, the desktop can quietly cross an important performance threshold. Nvidia now ships DGX Spark, a Grace Blackwell machine with 128 gigabytes of unified memory that runs models up to 200 billion parameters at four bit precision, for about 4,700 dollars. Link two of them and you have 256 gigabytes. Open weight models in the right size class already run on it. The software stack to support all this distributed inference, fast interconnects, model and machine management has matured in months. Quite literally, a box that fits next to a monitor now does work that required a rack of rented accelerators two years ago.
Put the two together. Frontier grade open models. A cost curve about to resume its fall. Consumer hardware that can host real models locally. Within three to four years the most capable model most people touch every day will not live in someone else’s data center. It will live on a machine they own. Cloud models may be more powerful at the margins, but that difference will be made up by unlimited run time, local network and document access, privacy and much else.
Will The Datacenter Bet Survive?
If excellent models run locally, this can become a problem for one specific bet. The bet is that demand for centralized inference will grow fast enough, for long enough, to justify hardware depreciated over five and six year schedules.
Michael Burry has made the accounting case loudly. Hyperscalers write down Nvidia silicon over five or six years while the real economic life of a chip is closer to two or three. He puts the understated depreciation across the industry at roughly 176 billion dollars through 2028. Goldman frames the same risk plainly. A fifty thousand dollar accelerator on a five year schedule carries ten thousand dollars a year in depreciation. If a new generation makes it uneconomic to run in year two, the operator still carries an asset that no longer earns. Multiply that across hundreds of thousands of units.
The first lease renewal cliff for the 2023 and 2024 build out hits late this year and next. Roughly half the data centers planned in the United States for 2026 already face delay or cancellation. A town in Wisconsin just passed the first voter referendum requiring approval for large data center incentives. Prediction markets put the odds of a federal moratorium before 2027 at about one in three.
Now consider the demand. If open models keep closing the gap, and if the cheapest place to run them becomes the device on your desk, the centralized inference demand curve that underwrites a five year depreciation schedule does not need to collapse. It only needs to grow slower than the spreadsheets shared with investors assumed. That is enough for major trouble to ensue.
Arguably, even a move to the edge and to local models may not hurt Nvidia. The company sells the accelerators in the data center and it sells the silicon at the edge. DGX Spark is Nvidia. The chips in the next generation of workstations and consumer cards are Nvidia. If inference migrates from the rack to the desk, Nvidia simply follows the workload. The risk does not sit with the company selling shovels. It sits with the operators who borrowed against a single mine and wrote a six year schedule for a two year asset.
Scary, But Good
There is a final reason this potentially bubble-popping shift is not just likely but also good.
Every time you send a prompt to a hosted model you hand over information. Not only the question. The context. The document you pasted. The code base you are debugging. The deal you are modeling. The diagnosis you are worried about. The strategy you have told no one. We are pouring the most sensitive material of our professional and personal lives into systems we do not control, governed by terms we do not write, subject to retention and access rules that change without our consent. Recent export actions that cut foreign users off from specific models overnight should end any illusion that the hosted relationship is stable.
The DoD has just disclosed that Grok’s models were used in military action against Iran. Imagine a user in a particular country sharing information about his or her house, location, office, street only to have this be used as training data that could one day cause their street to be bombed. It’s a horrendous thought, but many people now see these risks clear as day.
The only environment you can fully trust is the one you own. A model running on your machine, on your weights, behind your firewall, leaks nothing. The prompt never leaves the building. For a clinician, a lawyer, an intelligence operative, a weaponeer, an engineer on restricted work, a founder guarding a thesis, that is not a nice to have. It is the whole game.
The technology is converging on local. The economics are converging on local. And the question of trust was always going to converge on local, because the most private thinking you do all day should not require a stranger’s server to complete.
Data center’s will continue to have their time in the sun. But I’d rather bet on open models at the edge. Build accordingly.











