There’s a new wrinkle in the saga of Chinese company DeepSeek’s recent announcement of a super-capable R1 model that combines high performance with low resource cost.

It shook up the U.S. stock market, and it’s still creating shock waves around the world. Just check out the latest episodes of the AI Daily Brief where host Nathaniel Whittemore reads headlines like “DeepSeek Shocks Silicon Valley” and “Meta AI Panic Mode”.

But the newest allegation is that DeepSeek actually used a particular process to put together its training data, and it’s one that some consider to be a little shady.

“They used OpenAI’s GPT-4 API at scale to generate responses for thousands or millions of prompts,” wrote Hannibal999 yesterday on X. “Automated API queries with multiple accounts to avoid detection and rate limits. Collected these responses as a dataset to fine-tune DeepSeek-R1, mimicking GPT-4’s instruction-following ability.”

That sounds sketchy, especially the part about avoiding rate limits, and the part about account-spamming.

Even the new U.S. president’s AI and crypto czar David Sacks is getting in on the action, saying in an interview with Fox News that there was “substantial evidence” that this kind of thing was going on.

“I think one of the things you’re going to see over the next few months is our leading AI companies taking steps to try and prevent distillation,” he said. “That would definitely slow down some of these copycat models.”

When you comb through these reports, there’s one word that keeps coming up again and again, and that’s “distillation.”

But even people fairly close to the industry may not know what this means. What is distillation, and why is it important?

The Teacher/Student Model

In the AI world, distillation refers to a transfer of knowledge from one model to another. I came across this piece on Medium that describes it in greater detail.

“Knowledge distillation refers to the process of transferring knowledge from a large model to a smaller one,” writes the author, Amit S. “This is vital because the larger knowledge capacity of bigger models may not be utilized to its full potential on its own. Even if a model only employs a small percentage of its knowledge capacity, evaluating it can be computationally expensive. Knowledge distillation is the process of moving knowledge from a large model to a smaller one while maintaining validity.”

So in many cases, the distillation is being done to get the refined results from a big model onto a smaller, more efficient model. That may not be conventionally true in DeepSeek’s case, there’s something different going on there, but it can be very useful in, say, learning to apply robust AI to endpoint devices.

“During training, the loss function will not only consider the difference between the output of the distilled model and the ground truth labels but also the difference between the output of the distilled model and the output of the CNN for the same input,” Amit writes.

Uses of Distillation in Autonomous Vehicles

One of the prime examples of this activity is to put sophisticated computer vision models into autonomous vehicles.

“Once the distilled model is trained, it can be deployed in self-driving cars, where it will require less computational resources and memory compared to the original CNN, while still maintaining a high level of accuracy in recognizing objects and traffic signs,” Amit explains.

To understand that, it’s important to know that the convolutional neural network or CNN is specifically made for computer vision and object recognition.

Unlike other kinds of neural nets, the CNN has particular metrics and layouts that allow the system to process what surround it in a visual field. So transmitting this knowledge to a more efficient model can be absolutely important for coming up with better self-driving models that are safer and more effective.

Other Types of Distillation

The Medium post goes over various flavors of distillation, including response-based distillation, feature-based distillation and relation-based distillation. It also covers two fundamentally different modes of distillation – off-line and online distillation.

The online method is more direct in real time, and the offline model is more a product of a pre-training process. You can read all about it in the Medium piece, or elsewhere, where industry experts break down the various applications for this method.

Then there’s self-distillation, where one model can do two things, and separate two processes, to essentially learn from itself.

At this point, it kind of sounds like we’re through the looking glass on how you would define distillation, since it’s supposed to be the transfer of knowledge from one model to another. You might call it AI virtualization? It also approaches the Marvin Minsky theory that I wrote about yesterday, that he put forth in Society of Mind – that any large organism is a collection of smaller ones working together.

In any case, this term, distillation, is going to be useful because it gets to the heart of how we evaluate neural networks. What are the rules? Right now, the U.S. is trying to tighten export controls to keep the Chinese from doing this sort of thing, and making “imitations” of powerful LLM systems. What comes next?

Keep an eye out on this blog as I continue to cover what’s going on right now with AI.

Bottom line: the terminology is going to be useful in figuring out what’s going on in the global market and in the geopolitical race for control of AI.

Share.

Leave A Reply

Exit mobile version