Motivation
A representation that merely classifies video is not yet a world model. To act, an agent needs a model of consequences: given the current state and an action, what state follows? V-JEPA 2 scales video feature-prediction to over a million hours of video to learn general visual dynamics, then asks whether a small amount of interaction data is enough to make that representation controllable — predictive of the effect of actions — so it can be used for planning without task-specific rewards or demonstrations.
How it works
Training has two stages:
- Self-supervised pretraining. A ViT (up to ~1B parameters) is trained with the V-JEPA objective — predict masked spatiotemporal feature blocks against an EMA target — on a very large, unlabelled video corpus. This yields an encoder $f_\theta$ that maps an observation $o_t$ to a latent $z_t$ capturing motion and dynamics.
- Action-conditioned post-training (V-JEPA 2-AC). The encoder is frozen and a predictor is trained on a comparatively small robot-interaction dataset to model $\hat z_{t+1}=g_\phi(z_t,a_t)$.
At test time the agent plans: it samples candidate action sequences, rolls the latent forward through $g_\phi$, and selects the sequence whose imagined trajectory reaches a goal embedding $z^*$ at lowest cost (model-predictive control). No reward model or task demonstration is required.
The objective
Pretraining uses the latent feature-prediction loss over masked tubelets, $\lVert g_\phi(z_{\text{ctx}},m)-\operatorname{sg}[\bar f_\theta(x)]\rVert^2$. The action-conditioned predictor is trained to forecast the next latent given the current latent and action,
$$\mathcal{L}_{\text{AC}} = \big\lVert\, g_\phi(z_t, a_t) - \operatorname{sg}[f_\theta(o_{t+1})] \,\big\rVert^2,$$
and planning solves $\min_{a_{t:t+H}} \; \lVert \hat z_{t+H} - z^* \rVert$ over a short horizon $H$.
Key results & what's novel
V-JEPA 2 reaches state-of-the-art motion understanding (e.g. Something-Something v2) and human action anticipation (e.g. Epic-Kitchens), and — most importantly — its action-conditioned variant performs zero-shot robot manipulation (reach, grasp, pick-and-place) on a real arm by planning in latent space, using a frozen pretrained encoder and only modest interaction data. The contribution is the demonstration that a self-supervised video JEPA can be cheaply converted into a planning-capable world model.
Strengths & limitations
- + Internet-scale pretraining transfers to control with a frozen encoder.
- + Zero-shot planning without task rewards or demonstrations.
- − Planning horizons are short; long-horizon control needs hierarchy (cf. hierarchical latent world models).
- − Goals must be specified as latent embeddings; the predictor is deterministic, so it does not represent uncertainty over futures.
- − Inference-time planning (sampling + rollouts) adds compute at deployment.