Latent prediction · World models · Planning · Self-supervised learning

World Modeling

A world model learns the dynamics of an environment: given part of the world — and optionally an action — it predicts what happens next, increasingly in representation space rather than raw observation space. This is a living, comprehensive encyclopedia of world models and the Joint-Embedding Predictive Architecture (JEPA) family: every method collected, explained, and cross-linked, and kept current as the field moves.

70 methods13 topics★ biology track

Browse the encyclopedia →Latest entries

The world-model landscape

A map of world models. The primary split is what a model predicts: generative models reconstruct future observations (pixels, frames, tokens), while joint-embedding predictive models — the JEPA family — predict future representations. The second axis is agency: a passive model forecasts dynamics as they unfold, whereas an action-conditioned model predicts the consequence of an action, $\hat z_{t+1}=g_\phi(z_t,a_t)$, and can therefore plan. Two further axes cut across these: temporal structure (single-step vs hierarchical, multi-scale rollout) and function (perception · prediction · planning · simulation). This encyclopedia concentrates on the lower, latent-predictive branch — and especially its controllable corner, where a representation becomes a usable world model.

What is a world model?

A world model is a learned, predictive model of how an environment evolves. In its most general form it encodes an observation $o_t$ into a latent state $z_t$, and — optionally conditioned on an action $a_t$ — predicts the next state, $\hat z_{t+1} = P(z_t, a_t)$. Roll that prediction forward and you have an imagination: a way to ask "what would happen if…?" without touching the real world.

This idea is old and deep. It runs from optimal control and the Kalman filter, through model-based reinforcement learning (Dyna, PILCO, World Models, Dreamer), to today's large self-supervised predictors. What unifies them is a single loop:

perceive → predict → evaluate → act. A perception module maps observations to a compact state; a transition model predicts how that state changes; a cost or value scores imagined futures; and a planner or policy chooses actions. Yann LeCun's Path Towards Autonomous Machine Intelligence casts this as the blueprint for autonomous agents, with the world model as the central, learnable component.

The reason world models matter is sample efficiency and foresight. An agent that can simulate consequences internally can plan over long horizons, generalize to unseen goals zero-shot, and avoid the cost (or danger) of trial-and-error in the real environment — whether that environment is a video game, a robot's workspace, a financial series, or a living cell.

Two ways to model a world

World models split along what they predict — and the difference is best seen as two diagrams.

1 · Generative — predict observations

Generative / reconstructive models predict future observations — pixels, tokens, raw counts. Video-prediction networks and latent-diffusion world models live here. They are visually interpretable, but they spend enormous capacity reconstructing detail that is irrelevant or fundamentally unpredictable (texture, sensor noise, dropout).

A generative world model encodes the past, predicts forward, then decodes back to observations. It is trained to reconstruct the next frame, so the loss lives in pixel space. This is expressive and directly inspectable — you can look at what it imagines — but it must spend capacity on every pixel, including unpredictable texture and sensor noise.

2 · Joint-embedding predictive (JEPA) — predict representations

JEPA models predict future representations instead. A context encoder embeds what is visible; an exponential-moving-average target encoder embeds what is hidden; a lightweight predictor maps context — plus mask or action tokens — to the target embedding; and the objective is simply $\lVert \hat z - \mathrm{sg}(z) \rVert^2$ with an anti-collapse term (EMA targets, VICReg-style variance/covariance, or LeJEPA's SIGReg).

A joint-embedding predictive world model predicts the next state's representation, not its pixels. A target encoder (an EMA copy) embeds the future observation; the predictor must match that embedding, so the loss lives in representation space. With no decoder, the model is free to discard unpredictable detail and keep only what is needed to anticipate the future.

By predicting in latent space, a JEPA is free to discard the unpredictable parts of the signal and keep only what is needed to anticipate the future. That is why the family has spread so quickly from images and video to audio, 3D, graphs, time series, and biosignals.

From representation to world model

Not every JEPA is a world model. Masked-prediction variants (I-JEPA, V-JEPA, and most domain adaptations) learn excellent representations but do not yet model dynamics. A representation becomes a world model when the predictor is made action-conditioned — $\hat z_{t+1} = P(z_t, a_t)$ — so the latent can be rolled out under chosen actions. V-JEPA 2-AC, ACT-JEPA, LeWorldModel, and hierarchical-planning models make this step explicit, and recent theory (LeJEPA's world-model analysis) gives conditions under which the learned latent faithfully recovers the world's underlying variables and supports optimal planning. Those conditions are strong, so the field treats them as guidance to validate against, not a guarantee.

The biology thread ★

One question recurs throughout this collection: can the same latent-prediction recipe yield a world model of the cell? Gene expression, genomes, and molecules are noisy, sparse, and partially observed. Reconstructing raw counts wastes capacity on technical artifacts; the right target is the latent biological state. An action-conditioned JEPA whose actions are interventions — a gene knockout, a compound at a dose, a combination — becomes a counterfactual engine for drug discovery: predict how a cell state moves under a perturbation, then search for the intervention that drives it toward a desired state. Entries marked ★ are the building blocks of that program.

Featured methods

all methods →

A Path Towards Autonomous Machine Intelligence★2022

LeCun's blueprint for world-model-driven autonomous agents and the birth of the JEPA idea.

Theory

I-JEPA★2023

The first image JEPA: predict latent representations of target blocks from one context block, no augmentations.

Image

V-JEPA★2025

Self-supervised video representations by predicting masked spatiotemporal features in latent space.

Video

LeJEPA★2025

A theory of JEPAs pinpointing the isotropic Gaussian as the optimal embedding, enforced by SIGReg.

Theory

V-JEPA 2★2025

Internet-scale video world model whose action-conditioned variant enables zero-shot robot planning.

VideoAction

ACT-JEPA★2025

A JEPA jointly predicting action and observation latents for sample-efficient policy learning.

Action

Causal-JEPA★2026

Object-centric JEPA with entity-level latent masking for counterfactual reasoning and efficient planning.

ActionTheory

LeWorldModel★2026

Stable, teacher-free action-conditioned latent world model learned end-to-end from pixels via SIGReg.

ActionTheory

When Does LeJEPA Learn a World Model?★2026

Theory of when a JEPA provably recovers latent world variables up to rotation — read as design guidance.

Theory

Browse by topic

Foundations2 methodsThe position papers and energy-based ideas that define what a world model is and why prediction in latent space matters.Core Architectures4 methodsThe canonical JEPA models for images and video that established the recipe.Theory & Analysis12 methodsWhy JEPAs work, when they collapse, and what they provably learn.World Models, Robotics & Planning12 methodsAction-conditioned latent models that predict consequences and plan — the heart of world modeling.Biology & Drug Discovery4 methodsCells, genomes, proteins — latent world models for the drug-discovery pipeline.Graphs & Molecules2 methodsLatent prediction over graph structure, including molecular and polymer graphs.Medical Imaging & Biosignals8 methodsUltrasound, echo, X-ray, EEG, ECG and brain dynamics as latent predictive foundation models.Audio & Speech6 methodsSpectrogram and waveform JEPAs for general audio, music and speech.3D & Point Clouds3 methodsSelf-supervised latent prediction over 3D shapes, scenes and point clouds.Time Series & Tabular7 methodsForecasting, anomaly detection and augmentation-free representation for sequences and tables.Earth Observation4 methodsRemote-sensing and satellite JEPAs spanning resolutions and modalities.Language & Multimodal3 methodsJEPA objectives for text, recommendation and text-image systems.Generative Modeling3 methodsUsing the JEPA objective for denoising and conditional generation.