In today’s column, I examine the rising tendency of employing big-sized generative AI and large language models (LLMs) to sharpen smaller-sized AI or SLMs (small language models). It is happening with increasing regularity. A new trend is underway.

This makes indubitable sense. Here’s why. Larger AI models readily contain a broad level of knowledge-related aspects due to having the digital memory space available to handle it. Smaller AI models are usually tight on space and not as widely data-trained. If there are elements that we want a smaller AI model to have, and the larger models contain it, a kind of transference can be undertaken, formally known as knowledge distillation since you distill or pour from the bigger into the smaller AI.

Let’s talk about it.

This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).

Big Versus Small In AI Models

The AI world currently has two distinct pathways regarding generative AI, there are large language models (LLMs) and there are small language models (SLMs).

If you use any of the major generative AI apps such as ChatGPT, GPT-4o, o1, o3, Claude, Llama, Gemini, etc., those are considered large language models due to being massively scaled. Tons and tons of online content was scanned to make those LLMs. The pattern-matching was immense and has been stored in a large-scale internal data structure within the AI.

One issue with LLMs is that they chew up a lot of digital memory space and they consume a slew of computing cycles to perform their processing. That’s why they are conventionally accessed in the cloud – they rely on expensive high-end computer servers and immense disk drives accordingly. The unfortunate downside is that you need an online connection to access them, and the usage can be costly if you are paying a fee for your various prompts and responses.

A means of solving this situation entails making use of small language models (see my discussion at the link here). They are often small enough that you can run them directly on your smartphone or laptop. No need to depend on an Internet connection. The cost of processing drops to near zero (assuming that the vendor doesn’t opt to apply some other hidden billing or charges).

The rub is this.

SLMs are not usually nearly as capable as LLMs.

You can’t expect something that holds only five pounds of gold nuggets to be as extensive as one that holds a hundred pounds of gold nuggets. An AI maker that establishes an SLM will typically aim to focus the SLM on some specific topic or domain. The SLM is considered narrow concerning the range of said-to-be knowledge that it holds. For more about how SLMs are able to squeeze a lot into a smaller framework, see my coverage at the link here.

Helping SLMs Via LLMs

Suppose that there is something in an LLM that we would greatly wish to have in an SLM.

For example, assume for the sake of discussion that we have in hand an SLM that lacks any semblance of info about what the stock market is all about. That wasn’t a topic covered during the initial data training of the SLM. In contrast, seemingly all LLMs would have encountered data about the nature of the stock market when the initial data training was scanning widely across the whole Internet. We would expect that an LLM would have plenty of data about stock market purposes and uses.

How can we leverage the LLM to aid the SLM in getting up-to-speed about the nature of the stock market?

A person who is a techie would likely right away think about doing some kind of internal data transfer from the guts of the LLM to the innards of the SLM. Maybe use an online tool to go ahead and copy or extract the stock market portions of the LLM. Then, use a similar tool to insert or copy into the SLM that particular extraction. Voila, problem solved.

Well, yes, that’s a potential means of doing a transfer from one model to another model. There are difficulties afoot. The odds are that the internal structures of the LLM are different from those of the SLM. In that sense, the conversion of the extraction to become something amenable to an insertion into the SLM can be problematic. I’m not saying it isn’t solvable, only that it can be a bear to deal with (costly, troublesome, and so on).

Maybe there’s another way to handle the transference.

Yes, indeed, we can do something much easier, namely lean into the fact that generative AI is based on the entering of prompts and the generating of responses. That is how they are natively designed to work. We could perhaps prompt our way to get the data from an LLM and prompt our way to provide that into the SLM.

Prompts are wonderful as you will soon see.

Example Of LLM To SLM Transference

Let’s log into two AI systems, one is an LLM, and the other is an SLM.

We will proceed to use the LLM to essentially train or teach the SLM about the stock market. The LLM and SLM are to be directly connected to each other. They carry on a conversation. If you’ve ever used generative AI, you’ve undoubtedly engaged in a conversation whereby you enter a prompt, the AI responds, you enter another prompt, the AI responds, and so on. LLM and SLM will do the same, though just with each other and there isn’t a human in the loop.

Consider the LLM to be a teacher in this scenario, and the SLM to be a student.

Here we go.

  • LLM (Teacher AI): “This is an overview introduction to the stock market. The stock market is a complex system where shares of publicly traded companies are bought and sold. Do you know what a stock is?”
  • SLM (Student AI): “No. What is a stock?”
  • LLM (Teacher AI): “A stock represents ownership in a company. Owning a share means owning a piece of that company’s equity.”
  • SLM (Student AI): “Got it. Do prices of stock shares fluctuate or do they remain fixed in value?”
  • LLM (Teacher AI): “Prices of shares fluctuate based on supply and demand, influenced by factors like company performance, market trends, and investor sentiment.”

Take a close look at the snippet of the AI-to-AI conversation that I’ve provided above.

One crucial aspect is that this data transfer is unlike a conventional form of data transfer. In a techie-oriented data transfer, you would identify a set of data to extract and figure out a place in the targeted SLM to plop that data. None of that is happening in this alternative approach. Instead, the extraction is varied, taking place by conversation, and the insertion is varied, occurring based on the dialogue underway rather than a straight-ahead data transfer.

Pretty nifty.

AI Teacher And The AI Student

The beauty is that the data training is working at a higher level of consideration and not at the bits and bytes level. The LLM and SLM could carry on with this conversation and extensively have the LLM tell as much as it can about the stock market to the SLM.

If that seems like a slow way to do things since conversations are a laborious task, remember that this dialogue is happening between two AI systems. The speed of this dialogue could be far faster than any dialogue of a human-AI caliber. The pace would be only limited by how fast the two AIs can process prompts and responses.

In the AI field, this process of having two AI models undertake a data transference is referred to as knowledge distillation. We are distilling data, or some would say knowledge, from one AI model to another.

Does the direction always have to be from an LLM to an SLM?

Nope.

SLM As Teacher And LLM As Student

Sometimes we might want the SLM to be the teacher and the LLM to be the student. The circumstance usually goes like this. An SLM has been data trained on some niche or been given extensive data training on a narrow topic that isn’t fully covered in an LLM. Suppose we want the LLM to also have the same niche or depth as found in the SLM.

Easy-peasy, just do the same as mentioned previously but put the SLM in the driver’s seat.

Here’s an example.

  • SLM (Teacher AI): “Do you know what synthetic aperture radar (SAR) is?”
  • LLM (Student AI): “I only know of the acronym and not much more. Tell me about SAR.”
  • SLM (Teacher AI): “It’s a type of radar used to create detailed images, often from satellites or aircraft, by using the movement of the radar platform to simulate a larger antenna. SAR processes signals to achieve high-resolution imaging, regardless of weather or lighting conditions.”
  • LLM (Student AI): “Got it. How does SAR achieve detailed resolution?”
  • SLM (Teacher AI): “By combining multiple radar signals over time as the platform moves, it synthesizes a larger ‘virtual’ antenna for higher resolution.”

In this instance, the LLM knew of the topic but didn’t have any further details. SLM was able to share with the LLM aspects that now will be also contained in the LLM.

Permutations And Combinations

I trust that you already envision these four possibilities:

  • (1) LLM-to-SLM. We do a distillation from a large language model to a small language model.
  • (2) SLM-to-LLM. We do a distillation from a small language model to a large language model.
  • (3) LLM-to-LLM. We do distillation from a large language model to another large language model.
  • (4) SLM-to-SLM. We do distillation from a small language model to another small language model.

You saw those first two possibilities of LLM-to-SLM and SLM-to-LLM in the examples noted above.

There are times when you might want an LLM-to-LLM distillation. Why might that be? Please be aware that the LLMs of different AI makers are data-trained on different portions of the Internet. Sure, there is a huge amount of overlap about what data they each scanned, but there are still some differences. It could be that one LLM covered aspects in its scan that another LLM would find handy to have.

The same can be said about the SLM-to-SLM distillation. There might be occasions wherein one SLM has something that we want to be shared with a different SLM.

Things Can Go Wrong

Many aspects can go wrong with this prompt-based approach to distillation.

Suppose that the model doing the teaching does a poor job of deciding what to share with the other model. In my example about the stock market, the LLM doing the teaching might fail to cover all the things that the SLM ought to be trained in. The dialogue could meander. It might omit important points.

If you compare this form of data transfer to a strict data-to-data internal transfer of a relatively bound and precise nature, a lot of open-ended issues can readily arise.

The student model could also mess up. It might fail to ask suitable questions. It might falter when ingesting the info being provided by the teacher model. The interpretation of what the teacher model indicates could be miscast when then stored in the student model.

Not wanting to seem like a Gloomy Gus, both the teacher model and the student model could each falter, doing so at various junctures of the distillation process. One moment, the teacher model goofs. The next moment, the student model goofs. Yikes, it could be somewhat akin to those old Abbott and Costello skits such as the classic “Who’s On First?”

Astute AI developers who use prompt-oriented distillation are aware of those challenges and can take various precautions to cope with them.

Research On AI Knowledge Distillation

If you are intrigued by the emergence of AI knowledge distillation, there is a sizable amount of research on the evolving topic.

A handy place to start would be to read this recent survey and framework paper entitled “A Survey on Knowledge Distillation of Large Language Models” by Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou, arXiv, October 21, 2024, which made these salient points (excerpts):

  • “The concept of knowledge distillation in the field of AI and deep learning (DL) refers to the process of transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student).”
  • “The escalating need for a comprehensive survey on the knowledge distillation of LLMs stems from the rapidly evolving landscape of AI and the increasing complexity of these models.”
  • “The key to this modern approach lies in heuristic and carefully designed prompts, which are used to elicit specific knowledge or capabilities from the LLMs. These prompts are crafted to tap into the LLM’s understanding and capabilities in various domains, ranging from natural language understanding to more complex cognitive tasks like reasoning and problem-solving.”
  • “The use of prompts as a means of knowledge elicitation offers a more flexible and dynamic approach to distillation. It allows for a more targeted extraction of knowledge, focusing on specific skills or domains of interest. This method is particularly effective in harnessing the emergent abilities of LLMs, where the models exhibit capabilities beyond their explicit training objectives.”

The study does a helpful job of identifying the range of AI research in this discipline and provides plentiful references and citations for you to then dig further into the weighty matter.

Where Things Are Headed

As the number of generative AI models proliferates, the odds are that we will want to do more and more distillation among the models. Each model will inevitably be missing something that another model contains, and we will see great value in sharing across AI models. This raises some hefty AI ethics and AI legal questions, see my discussion at the link here, and doing distillation across variously owned models could be a legal quagmire.

Another twist is that there are opportunities to advance in this realm via aspects such as multi-teacher distillation and/or multi-student distillation. The upshot is this. We might have several LLMs that we want to all at once do distillation simultaneously into say an SLM. Likewise, we might have a bunch of SLMs that we want to all at once be data trained at the same moment in time.

Exciting prospects.

I will give you a final thought for now that you are welcome to mindfully ponder. Albert Einstein famously said this: “It is the supreme art of the teacher to awaken joy in creative expression and knowledge.”

When a human teaches another human, the usual aim is more than just a knowledge transfer going on. The hope is that the teacher will inspire the student. The student will see the grand value in whatever is being taught. Should we expect the same when doing an AI-to-AI teacher-student endeavor?

Hogwash, some might exhort. AI is not sentient. It doesn’t need nor care about being inspired. This is entirely and solely about moving data from one pile to another. Period, end of story.

That might be the case at this time, but if we attain artificial general intelligence (AGI), or possibly even artificial superintelligence (ASI), will that still be the same? Compelling arguments suggest that these softer human-oriented facets might become more pronounced.

Well, there you go, I’ve sought to distill this topic sufficiently for you — may your learning in life be both highly informative and breathtakingly inspiring.

Share.

Leave A Reply

Exit mobile version