World Models

In Short

A world model is a learned internal representation of an environment that an agent can query and simulate, enabling planning and reasoning without constant real-world trial and error. They are a central concept in model-based reinforcement learning and increasingly central to arguments about how to build genuinely intelligent, grounded AI systems.

01. What It Is

A world model is a system's internal, learned approximation of how its environment works. Given a current state and an action, the model predicts what the next state will be. The agent can then use this model to mentally simulate candidate action sequences, evaluate their consequences, and choose actions without needing to physically execute each possibility first.

The term entered machine learning through Juergen Schmidhuber's 1990 work on recurrent neural networks that predict future states. David Ha and Schmidhuber's 2018 paper "World Models" (arXiv:1803.10122) brought the concept to widespread attention, demonstrating agents that learned to play video games by training almost entirely inside hallucinated simulations generated by their own world model.

World models are distinct from systems that only classify inputs or generate outputs. They model dynamics: causality, physics, object persistence, and temporal transitions. The question they answer is not "what is this?" but "what will happen next if I do X?"

02. Why It Matters

Planning without physical cost:
An agent with a good world model can explore many possible futures mentally before committing to one action. This is how humans plan. A robot without a world model must try things physically to learn. A robot with one can reason ahead.

Data efficiency:
Real-world robotic data is expensive and slow to collect. A world model trained on existing data can generate synthetic experience for policy training. This is especially important for rare or dangerous situations, like an autonomous vehicle encountering a tornado or an unusual pedestrian configuration.

Grounding and physical understanding:
Yann LeCun's influential 2022 position paper "A Path Towards Autonomous Machine Intelligence" argues that true machine intelligence requires predictive models of the world, not just pattern matching over text. LeCun contends that a system trained only on language tokens has no way to predict physical events, making world models architecturally necessary for AGI.

Generalisation:
A world model that captures underlying physics or dynamics can generalise to novel situations that were not in training data, because it reasons about structure rather than surface statistics.

03. How It Works

The core architecture has three components. An encoder compresses raw sensory inputs (pixels, lidar, audio) into a compact latent representation. A predictor takes the current latent representation plus an action and outputs a predicted future latent representation. During training, the model minimises prediction error in this latent space.

Modern approaches often use the Joint Embedding Predictive Architecture (JEPA), proposed by LeCun. Rather than predicting every pixel of the next frame (which requires the model to predict irrelevant details like background noise), JEPA works in abstract embedding space. This is more efficient and focuses the model's capacity on semantically relevant dynamics.

Large-scale world models such as DeepMind's Genie series take a generative approach: given a text prompt or layout, Genie 3 (August 2025) produces photorealistic interactive worlds at 24 frames per second that maintain 3D consistency and physical plausibility. These are used as simulators for downstream agent training.

Nvidia's Cosmos 3 (June 2026) integrates physical reasoning, world simulation, and action generation into a single open-weight model family using a Mixture-of-Transformers approach. An autoregressive transformer handles reasoning. A diffusion transformer handles multimodal generation. This represents a convergence of world model and foundation model paradigms.

Model-based reinforcement learning (MBRL) is the broader field in which world models appear. The Dreamer series, including DreamerV2 (Hafner et al., arXiv:2010.02193), demonstrated that agents trained entirely inside learned world models can solve complex continuous control tasks that model-free approaches require far more real environment interaction to match.

04. Key Terms and Approaches

JEPA (Joint Embedding Predictive Architecture):
Predicts in abstract space rather than pixel space. Avoids the cost and brittleness of reconstructing every input detail. Meta's V-JEPA 2 (June 2025) achieved state-of-the-art on video understanding and enables zero-shot robot control in unfamiliar environments.

Latent space planning:
The agent simulates action sequences by rolling out predictions in latent space, then selects the action whose trajectory leads to the highest predicted reward. This is far cheaper than sampling real-world rollouts.

Dreamer:
A series of model-based RL architectures that train entirely in imagination using a learned world model. DreamerV2 (Hafner, Lillicrap et al.) and DreamerV3 demonstrated strong results across diverse environments, from Atari to continuous locomotion tasks.

Genie (DeepMind):
A generative world model trained on unlabeled internet videos. Genie 2 (late 2024) added 3D generation. Genie 3 (August 2025) produces real-time photorealistic interactive worlds from text prompts and has been adopted by Waymo for autonomous driving simulation.

Video as world model:
The observation that large video generation models implicitly learn physical and causal structure prompted research into using video models directly as world simulators. Sora (OpenAI) and similar systems demonstrated that coherent physics-like behavior can emerge from internet-scale video training.

Sim-to-real transfer:
The gap between simulation and the physical world is a key challenge. World models do not fully close this gap, but they help by providing richer and more controlled simulation environments than hand-engineered physics engines.

05. Examples

CarRacing (Ha and Schmidhuber 2018):
The landmark demonstration: an agent learned to drive a virtual racing car almost entirely inside its own self-generated dream. The policy trained in imagination then transferred to the real simulator.

DreamerV3:
Trained on a single fixed set of hyperparameters, DreamerV3 achieves human-level or above performance on many Atari games and continuous control benchmarks without task-specific tuning.

Waymo World Model:
Using Genie 3 as a base, Waymo (February 2026) built a specialised world model for autonomous driving that generates synchronised camera and lidar outputs for rare edge cases. This allows the planner to train on situations (tornadoes, unusual pedestrian behavior) that real-world fleets almost never encounter.

V-JEPA 2 (Meta, June 2025):
Achieves state-of-the-art on video understanding benchmarks and supports zero-shot robot control. Notably still struggles on IntPhys 2, a benchmark testing intuitive physics violation detection, which indicates world models are improving but have not yet achieved robust physical common sense.

Nvidia Cosmos 3 (June 2026):
Open-weight model family combining physical reasoning, simulation, and action generation. The Nano variant (16B parameters) is designed to run on workstation-class hardware.

06. Open Challenges

Compounding error:
Prediction errors accumulate over long rollouts. A small inaccuracy in a single step compounds quickly, making long-horizon planning unreliable. This is an active area of research.

Partial observability:
Real environments are only partially visible. A world model must infer hidden state from limited observations, which is harder than in fully observed simulated environments.

Distribution shift:
World models trained on one environment may not generalise to a different one. The implicit physics learned may be too narrow.

Physical common sense:
V-JEPA 2 performing near chance on IntPhys 2 is instructive. State-of-the-art world models do not yet robustly understand basic physics violations. This is a known gap between learned dynamics and the structured causal reasoning humans use.

Integration with planning:
Combining world models with efficient search or planning algorithms (Monte Carlo Tree Search, for example) at scale remains an open engineering and research challenge.

07. Common Pitfalls and Misconceptions

World models are not the same as language models:
An LLM predicts the next text token. A world model predicts future states in a physical or simulated environment. LeCun argues these are architecturally distinct capabilities, and that text prediction alone is insufficient for physical grounding.

Simulation is not reality:
A world model is a learned approximation. Even a very good one captures regularities from training data and will fail on inputs far outside that distribution. Treating a world model as a ground-truth physics engine invites failure.

"Dream training" does not eliminate the need for real data:
World models must themselves be trained on real interaction data. They reduce the additional data needed for policy training, but cannot replace grounding data entirely.

In Short

01. What It Is

02. Why It Matters

03. How It Works

04. Key Terms and Approaches

05. Examples

06. Open Challenges

07. Common Pitfalls and Misconceptions

Verified against primary sources

Key terms

Tags

Sources

More in Other Applications