At a glance
ProblemInvariances that help recognition can discard the state information a planner needs; it is unclear which invariant visual representations actually benefit planning.
Key ideaShape and probe the encoder's invariances and measure their effect on model-predictive planning, exposing an invariance-for-planning trade-off.
ModalityVision (+ actions)
Target / maskingStandard JEPA latent prediction against a target encoder; invariances induced and varied as the study variable.
Builds onJEPA action-conditioned world models and self-supervised invariance learning.
Used forGuidance on building encoders whose invariances retain control-relevant structure.

Motivation

Self-supervised encoders are often pushed to be invariant to nuisance factors — appearance, viewpoint, background — because invariance aids recognition and robustness. But control is not recognition. A planner must distinguish configurations that look similar yet behave differently, and an over-invariant encoder erases exactly the state information needed to do so. The same invariance that makes a representation transfer across scenes can make its latent transitions unplannable. This work asks, concretely, which invariant visual representations help planning with JEPA world models and which hurt it.

How it works

Videotubelets · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copyaction aₜ
Canonical JEPA schematic for Video. The input is split into a visible context and hidden targets (tubelet-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model.

Within the JEPA template — context encoder, predictor advancing latents under actions $\hat z_{t+1}=g_\phi(z_t,a_t)$, latent prediction loss with anti-collapse against a target encoder — the authors deliberately shape and probe the representation's invariances. They induce invariance to chosen factors (e.g. appearance, viewpoint, task-irrelevant background) and then measure the downstream effect on model-predictive planning, where candidate action sequences are rolled out and scored against a goal latent.

The recurring finding is a tension: invariance improves robustness and sample efficiency, but over-invariance removes the actionable state distinctions a planner relies on, degrading control. The study characterises where on this spectrum planning performance peaks for different nuisance factors.

The objective

The world model is trained with the usual action-conditioned latent loss

$$\mathcal{L} = \big\lVert g_\phi(z_t, a_t) - \operatorname{sg}[\bar f_\theta(o_{t+1})] \big\rVert^2 + \lambda\,\mathcal{R}(Z),$$

augmented by terms or augmentations that enforce invariance to chosen transformations $T$, e.g. encouraging $f_\theta(o)\approx f_\theta(T\,o)$. The experimental variable is which invariances are imposed and how strongly; the metric is planning return under model-predictive control. The desired regime is enough invariance to suppress distractors yet enough equivariance that the latent transition $g_\phi$ remains faithful and plannable.

Key results & what's novel

The contribution is a principled characterisation of the invariance-for-planning trade-off in JEPA world models. Rather than assuming more invariance is better, the study identifies which invariances retain control-relevant structure and which destroy it, giving guidance for building encoders that are invariant enough to generalise across scenes yet equivariant enough that latent transitions stay plannable. It refines a central design lever for JEPA-based control, connecting representation-learning choices directly to downstream planning performance.

Strengths & limitations

  • + Makes an often-ignored trade-off explicit and measurable for planning.
  • + Yields actionable guidance on which invariances to seek and which to avoid.
  • + Bridges self-supervised representation design and model-based control.
  • Findings depend on the specific transformations and environments studied.
  • The right balance is likely task- and domain-specific, not a single universal setting.
  • Inducing targeted invariances may require augmentations or supervision the base JEPA otherwise avoids.

Connections & references