The Rise Of The Multimodal LLM

There’s a new bit of jargon in the AI world, but it’s more than just a detail. It involves adding a familiar letter to a familiar acronym, and although that may sound glib, catching up might feel a little like déjà vu.

Do a quick conventional search for “LLMM.” You won’t come up with much, unless you check out the AI overviews, where Gemini in Google or Copilot in Bing tells you what this is.

“MLLM” does a bit better – you might find a result from IBM, and some academic papers, and a page from Github. But the idea of the Multimodal Large Language Model, or to some, the Large Language Multimodal Model, hasn’t really made it into the mainstream, to places like CNBC or Newsweek. It’s still sort of the province of the true tech geek – for now.

What is a Multimodal Large Language Model?

The essential concept of a Multimodal Large Language Model is that it works on different kinds of data, although there’s the implication that it does this through specific kinds of design. PhD researcher and engineer Sebastian Raschka defines the MLLM this way on a self-published platform:

“Multimodal LLMs are large language models capable of processing multiple types of inputs, where each ‘modality’ refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more.”

If you assume that the machines do this by attaining something like a sophisticated form of distillation, you’d be right. But there’s another component to this, too. In some ways, it sounds like engineers are going back to the well of using classical ML techniques to enhance what an LLM, as a central “brain,” can do.

This starts with attaching sensor tools to the LLM itself, to bring that multimodal data in.

“Recent research shows that Multimodal Large Language Models (MLLMs) can be enhanced with sensory gear (e.g., IoT sensors, wearables, cameras) by using visual prompting to ground them in real-world sensor data,” explains a summary of a paper called “By My Eyes” that’s pioneering this kind of research, where authors write:

“We design a visual prompt that directs MLLMs to utilize visualized sensor data alongside the target sensory task descriptions. Additionally, we introduce a visualization generator that automates the creation of optimal visualizations tailored to a given sensory task, eliminating the need for prior task-specific knowledge.”

The Art of Imitation

If the traditional token-based LLM approach imitated human writing by scouring the internet and applying prediction models, the new MLLM/LLMM system is able to, in a sense, learn by seeing. It’s not limited to text as an input, or an output. And it’s interactive.

“From a Human Computer Interaction (HCI) and Human Augmentation (HA) perspective, MLLMs also offer various opportunities,” writes Jun Rekimoto in an article maintained at the Association for Computing Machinery’s Digital Library. “If such models can recognize the world in ways similar to humans, a range of applications becomes possible. These include technologies that can record and understand skilled human actions for transfer to others, assess skill development, recognize real-world behaviors to provide personalized assistance and assist individuals with disabilities by augmenting their sensory perception of the environment.”

That said, there’s a lot that MLLMs can do that bypasses traditional inference. That’s especially true when it comes to real-world tasks involving physics. The developer world pondered, for about a year, how to teach LLMs about physics through text, and then the world realized that you could just equip the LLM to see, and teach it that way.

Terms from the Aughts

Take the term “feature extraction.”

A model, perhaps a convolutional neural network, can look at an image, analyze it, and extract features to classify and identify what’s in view. Now, you can attach that CNN to an LLM which will then process what the CNN sees and identifies. That’s a powerful combination, and it’s feeding a good deal of research into this kind of build.

Suppose you have a ball bouncing through a room and you want the LLM to “follow the ball.” How do you encode all of that information into the neural net? How do you “show” the model what the ball’s trajectory is like based on real-world physics?

Well, it’s a lot easier if the LLM can see.

Some of the experts are also pointing out that such equipped LLMs can know more about relational data from the jump, eliminating repetitive querying. Some sources estimate that the use of these novel models can lead to up to 75% FLOP reduction.

More Techniques

Within the realm of MLLM design, there’s more jargon emerging. For example, there’s the idea of token sparsification or compression. Here’s an explanation from a page at Github:

“Token compression reduces the number of visual tokens processed by MLLMs while preserving critical cross-modal semantics, enabling more efficient training and faster inference without large accuracy regressions. The field is fragmented across encoders, projectors, and LLM-side techniques; a centralized, searchable resource is needed.”

Then there’s structural pruning and knowledge distillation (here’s a paper) in which similar goals apply. Engineers are finding many ways to increase the efficiency of these models. As for attention mechanisms, there’s a lot of work being done on that, too, but maybe that’s another article.

So although it may look a little like roman numerals, the MLLM, as a descendant of the LLM, has a lot of potential. You may indeed hear a lot more about them, this year and in the years to come.

What's On

Only 20% Of Humans Have A Dimple In Their Chin — An Evolutionary Biologist Explains Why

Global Coal Use Hit A Record In 2025, Even As Coal Power Declined

Red-Hot RAV4, Toyota Hybrids Snagging Tesla Defectors, Says Edmunds

Powerball Jackpot Hits $633 Million—But A Winner Faces Steep Taxes

The No. 1 Belief That’s Secretly Running Your Whole Life — And A Test That Reveals Yours

Only 20% Of Humans Have A Dimple In Their Chin — An Evolutionary Biologist Explains Why

Red-Hot RAV4, Toyota Hybrids Snagging Tesla Defectors, Says Edmunds

The No. 1 Belief That’s Secretly Running Your Whole Life — And A Test That Reveals Yours

The EPOS Impact 1000 Headset Is Designed For Advanced AI Workflows

ChatGPT Medical Advice Lawsuit—What The Research Says About AI Diagnosis

Chickens Could Be Big Winners From AI’s $300 Billion Philanthropy Wave

Global Coal Use Hit A Record In 2025, Even As Coal Power Declined

Red-Hot RAV4, Toyota Hybrids Snagging Tesla Defectors, Says Edmunds

Powerball Jackpot Hits $633 Million—But A Winner Faces Steep Taxes

The No. 1 Belief That’s Secretly Running Your Whole Life — And A Test That Reveals Yours

AEW ‘Waiting’ To Sign Several Ex-WWE Stars

The EPOS Impact 1000 Headset Is Designed For Advanced AI Workflows

Alex Bowman’s Exit Creates NASCAR’s Most Coveted Opening

ChatGPT Medical Advice Lawsuit—What The Research Says About AI Diagnosis

What's On

The Rise Of The Multimodal LLM

What is a Multimodal Large Language Model?

The Art of Imitation

Terms from the Aughts

More Techniques

Related News