In today’s column, I examine the sudden and dramatic surge of interest in a form of AI reasoning model known as a mixture-of-experts (MoE). This useful generative AI and large language model (LLM) approach has been around for quite a while and its formulation is relatively well-known. I will lay out for you the particulars so that you’ll have a solid notion of what mixture-of-experts is all about.
The reason MoE suddenly garnered its moment in the spotlight is due to the release of DeepSeek’s model R1 which uses MoE extensively.
In case you haven’t been plugged into the latest AI revelations, DeepSeek is a China-based AI company that has made available a ChatGPT-like LLM. They also made a bold claim that the AI was devised at a significantly reduced cost, causing quite a pronounced stir since the bulk of USA AI efforts assume that only vast and costly amounts of hardware could produce such a capable model. Hardware provider Nvidia took a big hit in their stock price and most of the major USA AI firms also saw their stocks get punished.
If you are interested in whether we need to keep scaling up or whether maybe we need to work smarter when it comes to advancing AI, see my in-depth analysis at the link here.
I will focus my discussion here on the overall nature of the mixture-of-experts approach, including how it has both key advantages and disadvantages. That being said, please know that the DeepSeek AI system is said to have also leaned heavily into the use of knowledge distillation, which I’ve covered extensively at the link here. The idea is that you can rapidly bring a new AI system up to speed by using an existing fully capable one to transfer or train the newbie lesser model. Some assert that if DeepSeek took that route, in a sense they were “cheating” in that they relied on someone else’s existing model to parlay their instance into existence (see my discussion about the ethical and legal intellectual property or IP rights issues that arise, at the link here).
Another factor of their AI approach is that they extensively utilized reinforcement learning (RL) techniques, which I’ve covered at the link here.
The star of this show is mixture-of-experts or MoE.
Let’s talk about it.
This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).
Mixture-Of-Experts Hits The Jackpot
The mixture-of-experts approach traces its roots back to the early 1990s and I will toward the end of this discussion share with you some fascinating quotes from a now-classic AI research paper that got the ball rolling on this innovation in 1991. By and large, the core facets are still true today. The core is what I’m going to introduce to you.
To get things underway, I am going to provide a simplified version of MoE. I say this to try and avoid getting harangued by trolls who will be distraught that I won’t be portraying the mathematical and computational complexities. Those nitty-gritty details are certainly valuable, and indeed, I will provide references throughout where you can readily learn more on the matter.
Here’s the deal.
The most common way to devise generative AI consists of designing the internal structure or architecture as one large monolith. Any prompt that a user enters will essentially be routed throughout this massive byzantine structure and have zillions of touchpoints as it progresses and is processed. This is easy to devise, somewhat easier to maintain, and generally works pretty well.
An alternative approach is to decide that we will divide up the monolith into respective parts or components. Each component will have a specific purpose. The AI parlance is to say that these components are “experts” and that we are dividing up the processing into occurring across a multitude of these experts.
I personally disfavor the use of the word “expert” to refer to the components since this tends to anthropomorphize the AI, as though it is a set of human experts. Each component is merely devoted to some area of interest, and I would say are specialized in what they can suitably process. Anyway, the word “expert” is catchy and has become the default terminology in use.
The bottom line is that the mixture-of-experts approach consists of deciding beforehand that when you create a generative AI or LLM you will divide it up into some number of components that we’ll say are experts. Doing so has some exciting properties that can boost the AI in comparison to the run-of-the-mill monolithic route.
Handiness Of A Bundle Of Experts
The good news about mixture-of-experts is that you can hone the AI to be adept in specific areas or domains.
For example, suppose I wanted to ask generative AI some questions about various U.S. presidents, in particular, let’s say Abraham Lincoln. A monolithic AI structure would take my prompt and feed my question about Honest Abe to the entire landscape of the AI system being utilized. Here and there, various bits and pieces about Lincoln might be touched upon. Eventually, it is hopefully brought together by the AI and presented to me as a cohesive answer to my question about Lincoln.
In the case of MoE, we might have set up the LLM on the basis of “experts” that are components devoted to specific presidents. There is a component about Lincoln. A different component concentrates on George Washington. And so on.
My prompt asking about Abraham Lincoln would get routed to the Honest Abe component. This component derives a response to my prompt. The response is then presented to me.
Voila, easy-peasy.
Importance Of Gating Requests
Note that a crucial element of the MoE is that the prompt by the user must be carefully parsed by the AI to suitably identify which expert ought to be activated.
Suppose that I asked about Abraham Lincoln, but the AI sent my prompt to the component on George Washington. Sad face. Not where my prompt should have gone. Now then, in theory, the George Washington component would detect that the prompt is about Lincoln and then pass the request either back to the routing function or directly to the Lincoln component.
You might say that the gateway or gating of the mixture-of-experts is a make-or-break activity. When a prompt comes in, the gateway had better do a decent job of figuring out which expert should get the request. If the wrong expert is selected, this could be a waste of time and effort. Worse, the wrong expert might try to generate an answer. Imagine my surprise if the AI responded with details about George Washington rather than Honest Abe.
The gating has to be reliable; it has to be quick since otherwise time is unduly consumed, and it is also a kind of bottleneck. The positive aspect is that your prompt is likely to end up at the right place in the fastest time in comparison to the meandering monolith approach. The negative aspect is that your prompt might get misrouted and ultimately even produce a wrong or inappropriate response.
Just another tradeoff in the harsh world we live in. Seems like you can never truly eat your cake and have the icing too.
Illustrative AI Example
To showcase how this works, let’s assume that you are using generative AI to help diagnose what is wrong with your car. Your car has been making some oddball noises lately. There doesn’t seem to be any obvious basis for the noise.
You log into a generative AI app that is made via a monolith-oriented architecture. The AI has various automotive mechanic capabilities spread around in the overall structure but there aren’t specific components devoted to say the engine, the transmission, etc.
Here we go.
- My entered prompt: “My car is making a strange noise. Can you fix it?”
- Generative AI (Car Mechanic #1): “I will take a look.”
- Generative AI (Car Mechanic #2): “I will take a look.”
- Generative AI (Car Mechanic #3): “I will take a look.”
As you can see, the AI is spreading your request across the landscape of the AI and merely generically seeking car mechanic portions.
You decide to log out of the AI and instead log into a mixture-of-experts or MoE-based model.
Here’s what happens.
- My entered prompt: “My car is making a strange noise. Can you fix it?”
- Generative AI (Car Diagnostician): “I will guide my team to look and diagnose what’s wrong with your car. My guess is that it might be an engine problem, so I’m going to ask my engine specialist or expert to examine the car first.”
- Generative AI (Engine Specialist): “I went ahead and looked at the engine. Your guess is correct, it is a problem with the engine. I can fix it if you want me to proceed.”
You can now see that the AI first used a gateway function that was labeled as a car diagnostician. This overall component decided that the specialized component or “expert” on engines would be the best place to route the request to. Sure enough, the engine expert said that the problem is with the engine.
Routing To Multiple Experts
Let’s use the same example and this time the engine expert says that the engine isn’t the culprit making the oddball noise.
- My entered prompt: “My car is making a strange noise. Can you fix it?”
- Generative AI (Car Diagnostician): “I will guide my team to look and diagnose what’s wrong with your car. My guess is that it might be an engine problem, so I’m going to ask my engine specialist to examine the car first.”
- Generative AI (Engine Specialist): “I went ahead and looked at the engine. The problem is definitely not the engine.”
- Generative AI (Car Diagnostician): “Thanks for letting me know that it isn’t the engine. I was mistaken in my initial guess. I think it might be the transmission, so I’ll have my transmission specialist look.”
- Generative AI (Transmission Specialist): “I’ve looked and discovered that it is a transmission problem. I can fix it if you want me to proceed.”
You can see that the engine expert reported back to the gateway. The gateway then opted to make another guess and routed the request to the transmission expert. Luckily, this was the right final place to land in.
What You See Or Don’t See
Depending upon how the MoE is shaped, sometimes you won’t realize that the various so-called experts within the AI are being consulted. There isn’t necessarily a need for you to know what took place during the processing of your prompt. All you probably care about is that you get a solid answer to your prompt.
Thus, another purpose of the gateway is often that it routes internally, collects what is being generated by the internal structure, and aggregates it into the final response that you ultimately see. Everything else is hidden from your view.
Here’s what that might be like.
- My entered prompt: “My car is making a strange noise. Can you fix it?”
- Generative AI (Car Diagnostician): “I will guide my team to look and diagnose what’s wrong with your car. I’ll get back to you in a moment.”
- Generative AI (Car Diagnostician): “Internal communique: Engine expert, please see if the engine is making a noise and let me know what you find.”
- Generative AI (Engine Specialist): “I went ahead and looked at the engine. I believe that the sparkplugs need to be changed. I’m not sure if doing so will completely resolve the issue but that’s my best guess.”
- Generative AI (Car Diagnostician): “Thanks for letting me know about your guess. Since you aren’t completely sure, I’ll ask the transmission specialist to weigh in too.”
- Generative AI (Transmission Specialist): “I’ve looked and discovered that the transmission fluid is low, and the car seems to be having trouble shifting gears accordingly. I think that might account for the noise.”
- Generative AI (Car Diagnostician): “Okay, I will tell the user that we are recommending then that sparkplugs be replaced and the transmission fluid be refilled. This hopefully will fix the noise problem.”
In this instance, the gateway carried on an internal conversation with the various AI experts. Once a resolution was devised, the gateway aggregated the results and was ready to present the recommendation to the user.
Upfront Mighty Decisions
For an AI maker or developer, a big challenge when leaning into a mixture-of-experts approach is what the various experts or components will be focused on. You usually must decide this upfront. Once you’ve made that decision, the structure is somewhat locked into place.
Suppose I opted to go with components focused on USA presidents.
A user comes along and wants to ask about USA governors.
Oops, my LLM wasn’t structured on that basis. At that juncture, it could be that I have set up only some experts and allowed a semi-monolith for the rest of the structure, and I will then route the prompt to that monolith portion. You could argue that this is not too bad since that’s what would have happened in a pure monolith structure, on the other hand, the experts aren’t being of much advantage if that happens quite a lot.
Another notable design question is how many experts to have. Should I have experts per president, and then experts per vice president, or maybe just have experts that each cover a pair of a president and their associated vice president? This is a tough aspect to decide at the get-go.
Data Training Of The MoE
Generative AI is data trained by scanning the Internet and pattern-matching on human writing. The good news about MoE is that you can direct the scans toward the areas that pertain to the experts or components that you’ve decided upon. This can expedite data training.
A downside is that you need to figure out whether the components or experts are getting sufficient data training. When you do things for a monolith, you at times just let the data training go as long as you can afford to do so or run out of data for training purposes. The MoE is usually best undertaken by assessing how deep and sufficient the components are during the initial data training.
The upshot is that data training requires a different semblance of perspective depending upon whether you are data training an underlying monolith LLM or a MoE model.
More About Those Experts
Lots more twists and turns come into the picture.
Some advocate that MoE should have shared experts along with dedicated experts. This involves deciding that there will be specialized components that any of the other experts can use versus ones that are only to be used by say the gateway.
The beauty of having multiple experts or components is that you can potentially leverage parallelization during the processing of a prompt. Envision that I wrote a prompt asking about both Abraham Lincoln and George Washington. In a conventional monolith, this is clunkily floated around to figure out which bits and bytes have something relevant to generate. With MoE, the gateway could split the prompt into two pieces, routing one piece to the Lincoln expert and the other piece to the George Washington expert, having both components working simultaneously on my prompt. Nice.
Speed can be a great outcome of using MoE.
A troubling aspect though can at times arise. The gateway might start to overuse some of the experts. Those particular experts become excessively relied upon. The AI gets sluggish. An important ongoing chore might be to monitor the spread of the usage and have some form of balancing algorithm to try and get things evened out if feasible.
Foundational AI Research
If you are intrigued by the MoE approach, you might enjoy the now-classic AI research paper that many attribute to having gotten this line of architecture underway (please know that prior papers also postulated and examined the topic).
The paper is entitled “Adaptive Mixtures of Local Experts” by Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton, Neural Computation, 1991, and made these salient points (excerpts):
- “We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases.”
- “The new procedure can be viewed either as a modular version of a multilayer supervised network or as an associative version of competitive learning.”
- “If we know in advance that a set of training cases may be naturally divided into subsets that correspond to distinct subtasks, interference can be reduced by using a system composed of several different “expert” networks plus a gating network that decides which of the experts should be used for each training case.”
The emphasis is that generative AI and LLMs typically make use of artificial neural networks (ANN) as the underpinning of the AI, see my explanation about such neural networks at the link here.
You can create a humongous artificial neural network as one massive monolith or divide it into subnetworks. Those subnetworks are the components or experts of the MoE structure. As mentioned earlier, doing so can make processing faster, which in AI parlance you say that the “inferencing” is sped up. Furthermore, again in AI parlance, the usual feed-forward network (FFN) layers are essentially replaced with MoE layers. Each MoE layer usually makes use of one or more experts and has at least one gating function.
Winner-Winner Chicken Dinner
DeepSeek’s AI makes use of mixture-of-experts, as do several other high-profile LLMs such as Mixtral by Mistral, NLLB MoE by Meta, and others.
No one can say for sure whether MoE is the best or optimum way to devise generative AI. I’ve noted that it has upsides and downsides. It isn’t a silver bullet. The shock and awe of DeekSeek’s AI will certainly cause a rapid and wider pursuit of mixture-of-experts. Some will rush to include it. Others will be dubious and argue that the downsides outweigh the pluses.
All in all, we are in for exciting times.
Let’s give the last word for now to Aristotle: “Excellence is never an accident. It is always the result of high intention, sincere effort, and intelligent execution; it represents the wise choice of many alternatives — choice, not chance, determines your destiny.”
We will have to wait and see whether MoE is a choice, a chance, or a destiny.