Motivation
A JEPA world model for control bundles many decisions: encoder size and architecture, the masking scheme, how actions are injected, the prediction horizon, the anti-collapse mechanism, and the planning solver at test time. When such a model plans well, it is genuinely hard to say why — and when it fails, harder still to say which knob to turn. The folklore that bigger encoders or lower latent-prediction loss imply better control is largely untested. This study sets out to replace that folklore with controlled evidence, asking which design factors causally drive planning return as opposed to merely improving reconstruction or representation metrics that do not transfer to action.
How it works
The backbone is the standard recipe. A context encoder embeds observations into a latent $z_t$; a predictor advances the latent under actions, $\hat z_{t+1}=g_\phi(z_t,a_t)$; training minimises a latent prediction loss against a target encoder with explicit anti-collapse. At deployment, model-predictive control samples candidate action sequences, rolls latents forward through $g_\phi$, and scores each rollout against a goal embedding.
The study then varies one component at a time — encoder, masking, action conditioning, horizon, anti-collapse, planner — while holding the rest fixed, and measures planning performance across three benchmarks: DROID (real robot manipulation), Metaworld (simulated manipulation), and Push-T (contact-rich planar pushing). This separates representation-learning quality from planner competence.
The objective
Each model is trained with the action-conditioned latent-prediction loss
$$\mathcal{L} = \big\lVert\, g_\phi(z_t, a_t) - \operatorname{sg}[\bar f_\theta(o_{t+1})] \,\big\rVert^2 \;+\; \lambda\,\mathcal{R}(Z),$$
where $\bar f_\theta$ is the (possibly EMA) target encoder, $\operatorname{sg}$ is stop-gradient, and $\mathcal{R}$ is the anti-collapse regulariser. At test time the planner solves $\min_{a_{t:t+H}} \lVert \hat z_{t+H} - z^* \rVert$ over horizon $H$. The experimental variable is which piece of this pipeline is changed; the metric is downstream planning success, decoupled from the value of the training loss itself.
Key results & what's novel
The contribution is a controlled account of what drives success in physical planning with JEPA world models. Rather than proposing a new architecture, it maps which representation properties — temporal consistency, faithful action-grounding, smooth and well-conditioned latent transitions — correlate with planning return, and where naive scaling or lower reconstruction loss fails to help. The recurring lesson is that planning competence hinges on accurate, plannable latent transitions under actions, not on visual fidelity or generic representation-quality metrics. The work supplies design guidance and released models, turning JEPA-for-planning from a collection of empirical tricks into something closer to a recipe.
Strengths & limitations
- + Controlled, single-factor ablations across three distinct control benchmarks.
- + Cleanly separates representation quality from planner competence.
- + Yields actionable design guidance and reusable models.
- − Conclusions are bounded by the studied benchmarks and the chosen design axes; some interactions between factors may not generalise.
- − As an empirical study it offers heuristics rather than theory about why a given factor matters.