What Drives Success in Physical Planning with JEPA World Models?

At a glance

ProblemMany design choices go into a JEPA world model for control, and it is unclear which ones actually drive successful planning rather than reconstruction quality.

Key ideaHold the JEPA-for-planning pipeline fixed and ablate components one at a time across diverse control benchmarks to isolate what causes good planning.

ModalityVision (+ robot/agent actions)

Target / maskingSpatiotemporal masking with a target encoder (typically EMA); varied as one of the studied factors.

Builds onV-JEPA 2 and the action-conditioned latent-prediction recipe for planning.

Used forDesign guidance for building plannable JEPA world models; benchmarked on DROID, Metaworld, and Push-T.

Motivation

A JEPA world model for control bundles many decisions: encoder size and architecture, the masking scheme, how actions are injected, the prediction horizon, the anti-collapse mechanism, and the planning solver at test time. When such a model plans well, it is genuinely hard to say why — and when it fails, harder still to say which knob to turn. The folklore that bigger encoders or lower latent-prediction loss imply better control is largely untested. This study sets out to replace that folklore with controlled evidence, asking which design factors causally drive planning return as opposed to merely improving reconstruction or representation metrics that do not transfer to action.

How it works

Canonical JEPA schematic for Video. The input is split into a visible context and hidden targets (tubelet-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model.

The backbone is the standard recipe. A context encoder embeds observations into a latent $z_t$; a predictor advances the latent under actions, $\hat z_{t+1}=g_\phi(z_t,a_t)$; training minimises a latent prediction loss against a target encoder with explicit anti-collapse. At deployment, model-predictive control samples candidate action sequences, rolls latents forward through $g_\phi$, and scores each rollout against a goal embedding.

The study then varies one component at a time — encoder, masking, action conditioning, horizon, anti-collapse, planner — while holding the rest fixed, and measures planning performance across three benchmarks: DROID (real robot manipulation), Metaworld (simulated manipulation), and Push-T (contact-rich planar pushing). This separates representation-learning quality from planner competence.

The objective

Each model is trained with the action-conditioned latent-prediction loss

$$\mathcal{L} = \big\lVert\, g_\phi(z_t, a_t) - \operatorname{sg}[\bar f_\theta(o_{t+1})] \,\big\rVert^2 \;+\; \lambda\,\mathcal{R}(Z),$$

where $\bar f_\theta$ is the (possibly EMA) target encoder, $\operatorname{sg}$ is stop-gradient, and $\mathcal{R}$ is the anti-collapse regulariser. At test time the planner solves $\min_{a_{t:t+H}} \lVert \hat z_{t+H} - z^* \rVert$ over horizon $H$. The experimental variable is which piece of this pipeline is changed; the metric is downstream planning success, decoupled from the value of the training loss itself.

Key results & what's novel

The contribution is a controlled account of what drives success in physical planning with JEPA world models. Rather than proposing a new architecture, it maps which representation properties — temporal consistency, faithful action-grounding, smooth and well-conditioned latent transitions — correlate with planning return, and where naive scaling or lower reconstruction loss fails to help. The recurring lesson is that planning competence hinges on accurate, plannable latent transitions under actions, not on visual fidelity or generic representation-quality metrics. The work supplies design guidance and released models, turning JEPA-for-planning from a collection of empirical tricks into something closer to a recipe.

Strengths & limitations

+ Controlled, single-factor ablations across three distinct control benchmarks.
+ Cleanly separates representation quality from planner competence.
+ Yields actionable design guidance and reusable models.
− Conclusions are bounded by the studied benchmarks and the chosen design axes; some interactions between factors may not generalise.
− As an empirical study it offers heuristics rather than theory about why a given factor matters.

Connections & references

Builds onV-JEPA 2 V-JEPA

Paper ↗Code ↗