Latent prediction · World models · Planning · Self-supervised learning

World Modeling

A world model learns the dynamics of an environment: given part of the world — and optionally an action — it predicts what happens next, increasingly in representation space rather than raw observation space. This is a living, comprehensive encyclopedia of world models and the Joint-Embedding Predictive Architecture (JEPA) family: every method collected, explained, and cross-linked, and kept current as the field moves.

70 methods13 topics biology track

The world-model landscape

WorldmodelsGenerativepredict observationsJoint-Embedding Predictivepredict representationsPassive · observationalvideo prediction · video diffusionAction-conditioned · planningDreamer · Genie · GameNGenRepresentation · pretrainingI-JEPA · V-JEPA · JEPA-DNA · Cell-JEPAControllable · action-conditionedV-JEPA 2-AC · ACT-JEPA · LeWorldModel · BioJEPA-AC ★
A map of world models. The primary split is what a model predicts: generative models reconstruct future observations (pixels, frames, tokens), while joint-embedding predictive models — the JEPA family — predict future representations. The second axis is agency: a passive model forecasts dynamics as they unfold, whereas an action-conditioned model predicts the consequence of an action, $\hat z_{t+1}=g_\phi(z_t,a_t)$, and can therefore plan. Two further axes cut across these: temporal structure (single-step vs hierarchical, multi-scale rollout) and function (perception · prediction · planning · simulation). This encyclopedia concentrates on the lower, latent-predictive branch — and especially its controllable corner, where a representation becomes a usable world model.

What is a world model?

A world model is a learned, predictive model of how an environment evolves. In its most general form it encodes an observation $o_t$ into a latent state $z_t$, and — optionally conditioned on an action $a_t$ — predicts the next state, $\hat z_{t+1} = P(z_t, a_t)$. Roll that prediction forward and you have an imagination: a way to ask "what would happen if…?" without touching the real world.

This idea is old and deep. It runs from optimal control and the Kalman filter, through model-based reinforcement learning (Dyna, PILCO, World Models, Dreamer), to today's large self-supervised predictors. What unifies them is a single loop:

perceive → predict → evaluate → act. A perception module maps observations to a compact state; a transition model predicts how that state changes; a cost or value scores imagined futures; and a planner or policy chooses actions. Yann LeCun's Path Towards Autonomous Machine Intelligence casts this as the blueprint for autonomous agents, with the world model as the central, learnable component.

The reason world models matter is sample efficiency and foresight. An agent that can simulate consequences internally can plan over long horizons, generalize to unseen goals zero-shot, and avoid the cost (or danger) of trial-and-error in the real environment — whether that environment is a video game, a robot's workspace, a financial series, or a living cell.

Two ways to model a world

World models split along what they predict — and the difference is best seen as two diagrams.

1 · Generative — predict observations

Generative / reconstructive models predict future observations — pixels, tokens, raw counts. Video-prediction networks and latent-diffusion world models live here. They are visually interpretable, but they spend enormous capacity reconstructing detail that is irrelevant or fundamentally unpredictable (texture, sensor noise, dropout).

o_tobservationEncoderf_θPredictorg_φDecoderd_ψô_{t+1}pixelsz_to_{t+1}target‖ô − o‖²loss in pixel space
A generative world model encodes the past, predicts forward, then decodes back to observations. It is trained to reconstruct the next frame, so the loss lives in pixel space. This is expressive and directly inspectable — you can look at what it imagines — but it must spend capacity on every pixel, including unpredictable texture and sensor noise.

2 · Joint-embedding predictive (JEPA) — predict representations

JEPA models predict future representations instead. A context encoder embeds what is visible; an exponential-moving-average target encoder embeds what is hidden; a lightweight predictor maps context — plus mask or action tokens — to the target embedding; and the objective is simply $\lVert \hat z - \mathrm{sg}(z) \rVert^2$ with an anti-collapse term (EMA targets, VICReg-style variance/covariance, or LeJEPA's SIGReg).

o_tobservationContext encf_θPredictorg_φẑ_{t+1}z_to_{t+1}Target encf̄_θ · EMAz_{t+1}sglatent loss‖ẑ − z‖²
A joint-embedding predictive world model predicts the next state's representation, not its pixels. A target encoder (an EMA copy) embeds the future observation; the predictor must match that embedding, so the loss lives in representation space. With no decoder, the model is free to discard unpredictable detail and keep only what is needed to anticipate the future.

By predicting in latent space, a JEPA is free to discard the unpredictable parts of the signal and keep only what is needed to anticipate the future. That is why the family has spread so quickly from images and video to audio, 3D, graphs, time series, and biosignals.

From representation to world model

Not every JEPA is a world model. Masked-prediction variants (I-JEPA, V-JEPA, and most domain adaptations) learn excellent representations but do not yet model dynamics. A representation becomes a world model when the predictor is made action-conditioned — $\hat z_{t+1} = P(z_t, a_t)$ — so the latent can be rolled out under chosen actions. V-JEPA 2-AC, ACT-JEPA, LeWorldModel, and hierarchical-planning models make this step explicit, and recent theory (LeJEPA's world-model analysis) gives conditions under which the learned latent faithfully recovers the world's underlying variables and supports optimal planning. Those conditions are strong, so the field treats them as guidance to validate against, not a guarantee.

The biology thread

One question recurs throughout this collection: can the same latent-prediction recipe yield a world model of the cell? Gene expression, genomes, and molecules are noisy, sparse, and partially observed. Reconstructing raw counts wastes capacity on technical artifacts; the right target is the latent biological state. An action-conditioned JEPA whose actions are interventions — a gene knockout, a compound at a dose, a combination — becomes a counterfactual engine for drug discovery: predict how a cell state moves under a perturbation, then search for the intervention that drives it toward a desired state. Entries marked are the building blocks of that program.

Featured methods

all methods →

Browse by topic