At a glance
ProblemTurn large-scale video representation learning into a usable world model that can plan actions.
Key ideaPretrain a video JEPA at internet scale, then post-train an action-conditioned predictor (V-JEPA 2-AC) that rolls the latent forward under actions.
ModalityVideo (+ robot actions for the AC variant)
Target / maskingSpatiotemporal multi-block (tubelet) masking; EMA target encoder.
Builds onV-JEPA (feature prediction from video) and I-JEPA.
Used forMotion understanding, action anticipation, and zero-shot robot planning.

Motivation

A representation that merely classifies video is not yet a world model. To act, an agent needs a model of consequences: given the current state and an action, what state follows? V-JEPA 2 scales video feature-prediction to over a million hours of video to learn general visual dynamics, then asks whether a small amount of interaction data is enough to make that representation controllable — predictive of the effect of actions — so it can be used for planning without task-specific rewards or demonstrations.

How it works

obs o_tencoderf_θz_tpredictorg_φaction a_{t+0}predictorg_φaction a_{t+1}ẑ_{t+1}predictorg_φaction a_{t+2}ẑ_{t+2}goal z*minimise ‖ẑ − z*‖
V-JEPA 2-AC rolls the latent forward under chosen actions: $\hat z_{t+1}=g_\phi(z_t,a_t)$. Planning picks the action sequence whose imagined latent trajectory reaches a goal embedding $z^*$ at lowest cost — zero-shot, without acting in the world.

Training has two stages:

  1. Self-supervised pretraining. A ViT (up to ~1B parameters) is trained with the V-JEPA objective — predict masked spatiotemporal feature blocks against an EMA target — on a very large, unlabelled video corpus. This yields an encoder $f_\theta$ that maps an observation $o_t$ to a latent $z_t$ capturing motion and dynamics.
  2. Action-conditioned post-training (V-JEPA 2-AC). The encoder is frozen and a predictor is trained on a comparatively small robot-interaction dataset to model $\hat z_{t+1}=g_\phi(z_t,a_t)$.

At test time the agent plans: it samples candidate action sequences, rolls the latent forward through $g_\phi$, and selects the sequence whose imagined trajectory reaches a goal embedding $z^*$ at lowest cost (model-predictive control). No reward model or task demonstration is required.

The objective

Pretraining uses the latent feature-prediction loss over masked tubelets, $\lVert g_\phi(z_{\text{ctx}},m)-\operatorname{sg}[\bar f_\theta(x)]\rVert^2$. The action-conditioned predictor is trained to forecast the next latent given the current latent and action,

$$\mathcal{L}_{\text{AC}} = \big\lVert\, g_\phi(z_t, a_t) - \operatorname{sg}[f_\theta(o_{t+1})] \,\big\rVert^2,$$

and planning solves $\min_{a_{t:t+H}} \; \lVert \hat z_{t+H} - z^* \rVert$ over a short horizon $H$.

Key results & what's novel

V-JEPA 2 reaches state-of-the-art motion understanding (e.g. Something-Something v2) and human action anticipation (e.g. Epic-Kitchens), and — most importantly — its action-conditioned variant performs zero-shot robot manipulation (reach, grasp, pick-and-place) on a real arm by planning in latent space, using a frozen pretrained encoder and only modest interaction data. The contribution is the demonstration that a self-supervised video JEPA can be cheaply converted into a planning-capable world model.

Strengths & limitations

  • + Internet-scale pretraining transfers to control with a frozen encoder.
  • + Zero-shot planning without task rewards or demonstrations.
  • Planning horizons are short; long-horizon control needs hierarchy (cf. hierarchical latent world models).
  • Goals must be specified as latent embeddings; the predictor is deterministic, so it does not represent uncertainty over futures.
  • Inference-time planning (sampling + rollouts) adds compute at deployment.

Connections & references

Builds onV-JEPAI-JEPA