Xingjian “XJ” Zhang, Head of Growth at Apex.AI, explores the impact of Vision-Language Models on the future of autonomous driving.
As I highlighted in my last article, two decades after the DARPA Grand Challenge, the autonomous vehicle (AV) industry is still waiting for breakthroughs—particularly in addressing the “long tail problem,” or the endless array of rare, unforeseen scenarios that vehicles must handle.
Even the most routine trip can present unexpected challenges: sudden weather shifts, unexpected roadwork or erratic pedestrian behavior.
Today’s AVs rely heavily on high-definition maps, meticulously labeled datasets and rigid rule-based logic. They perform well in structured environments but struggle with unexpected situations. Think of them as well-rehearsed stage actors—flawless when following a script but lost when asked to improvise.
Expanding an AV system into a new operational design domain (ODD) requires continuous mapping updates, additional data labeling and extensive re-engineering—making the process costly and time-consuming.
New Frontier: Vision-Language Models
The emergence of vision language models (VLMs) offers a promising new approach. VLMs integrate computer vision (CV) and natural language processing (NLP), enabling AVs to interpret multimodal data by linking visual inputs with textual descriptions.
A growing number of projects are now leveraging VLMs for autonomous driving. One notable example is DriveVLM, a project by Li Auto and Tsinghua University. It employs a vision transformer encoder alongside a large language model (LLM). The vision encoder converts images into tokens, which an attention-based extractor then aligns with the LLM. This approach enables the system to generate a detailed linguistic description of the environment—weather conditions, road and lane attributes and critical objects, including in rare long-tail scenarios.
VLMs enhance the generalization of AV systems by leveraging pre-training on large-scale internet data, which provides a foundational understanding of the world while also improving scene understanding and planning, allowing AVs to better navigate complex environments.
Advancing Further: End-To-End VLMs
While VLMs are gaining traction, AV architectures are also undergoing a transformation from modular systems to end-to-end (E2E).
Traditional AVs compartmentalize perception, prediction and planning into separate modules. In contrast, end-to-end architectures unify these steps, processing raw sensor inputs to directly output driving actions, thus benefiting from joint feature optimization across perception and planning.
Deploying VLMs within end-to-end AV systems could be a game-changer. A prime example is Waymo’s End-to-End Multimodal Model for Autonomous Driving (EMMA), a VLM-based system that integrates perception and planning into a single framework. EMMA has achieved state-of-the-art performance in benchmark datasets such as nuScenes and the Waymo Open Motion Dataset (WOMD).
Unlike modular architectures, EMMA directly processes raw camera images and high-level driving commands to generate driving outputs—including trajectory planning, object detection and road graph estimation. This unified approach reduces error accumulation across independent modules while leveraging the extensive world knowledge embedded in pre-trained LLMs.
EMMA also employs self-supervised learning, similar to next-token prediction in LLMs, to anticipate traffic patterns. By iterating through multiple future motion scenarios, EMMA uncovers nuanced driving behaviors that conventional models might overlook. Furthermore, it enhances decision-making through chain-of-thought (CoT) reasoning, allowing the model to generate explicit justifications for its driving choices—improving both safety and interpretability.
VLMs In Cars: Why Not Yet?
Despite their promise, VLMs face significant challenges in real-world deployment.
Unlike static image generation models like DALL-E or Midjourney, autonomous driving requires processing continuous, high-dimensional video streams in real time. Capturing long-term spatial relationships in complex traffic environments demands advanced 3D scene understanding, a challenge that remains largely unsolved. For instance, Waymo’s EMMA is constrained to camera-only inputs, lacking fusion with 3D sensing modalities like LiDAR.
Another critical issue is inference latency. In AVs, every millisecond matters. Consider DriveVLM, which was tested on an NVIDIA Orin X and uses a four-billion-parameter Qwen model (in contrast, GPT-3 has 175 billion parameters and Llama 3.1 has 405 billion parameters). Even in this small configuration, it exhibits a prefill latency of 0.57 seconds and a decode latency of 1.33 seconds. This means the system requires 1.9 seconds to process a single scene. At 50 mph, the vehicle would have already traveled about 139 feet (42 meters) before it could react—an unacceptably long delay in critical situations.
While today’s VLMs need time before they hit the road, the pace of innovation is relentless. I believe future breakthroughs in model distillation will enable VLMs to become more efficient without compromising intelligence, while advancements in edge computing will significantly reduce inference latency. These developments will allow VLM-powered autonomous vehicles to process multimodal information in real time, bringing on-the-fly decision making closer to reality.
For years, AVs have struggled with the complexity of the real world. I am hopeful that VLMs will finally bring “vision,” “language” and “model” to the industry: a new vision that sees beyond the data it was trained on; a new language that interprets the chaos of the rare scenarios; and, a new model not just for driving, but for learning, reasoning and communicating with humans.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?