At a glance
ProblemJEPA pretraining is sometimes augmented with auxiliary objectives, but it was unclear when these help versus hurt.
Key ideaSystematically characterise why auxiliary tasks improve JEPA representations and how to choose and weight them.
ModalityAnalysis (representation learning)
Target / maskingMasked latent prediction against an EMA target encoder, plus complementary auxiliary objectives.
Builds onI-JEPA / V-JEPA masked-prediction pretraining and anti-collapse regularisation.
Used forGuiding the design of multi-objective JEPA pretraining for richer latent state spaces.

Motivation

The core JEPA loss trains a context encoder to predict the EMA target encoder's embeddings of masked regions. Practitioners often add auxiliary tasks — extra predictive or reconstructive objectives — but the effect is inconsistent: sometimes they sharpen representations, sometimes they dilute the primary signal. Why and How Auxiliary Tasks Improve JEPA Representations (Yu et al., 2025) studies this systematically, aiming to replace folklore with an account of the conditions under which auxiliaries yield better downstream representations.

How it works

Image / videotokens · blocksContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Image / video. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The study treats the JEPA loss as the primary signal and analyses how additional objectives interact with it. Masked latent prediction is indifferent to certain factors of variation; an auxiliary task can supply gradient for exactly those factors, enriching what the encoder captures. Auxiliaries can also act as extra anti-collapse pressure or reshape the geometry of the embedding space. The decisive variable is alignment between the auxiliary task and the structure of the target representation: well-aligned tasks add complementary, non-redundant signal, whereas misaligned tasks compete with the primary objective and distort the learned features.

The analysis

Training optimises a weighted combination of the primary and auxiliary losses,

$$\mathcal{L} = \mathcal{L}_{\text{JEPA}} \;+\; \sum_k \beta_k\,\mathcal{L}^{\text{aux}}_k,$$

and the analysis examines when adding $\mathcal{L}^{\text{aux}}_k$ improves the downstream-relevant subspace of the representation. The benefit grows when the auxiliary gradient is positively aligned with directions that the masked-prediction loss leaves underdetermined, and turns harmful when it conflicts with them — giving both an explanation and prescriptive guidance on selecting and weighting the $\beta_k$.

Why it matters

Real world models must encode many factors of variation, and pure masked prediction can leave controllable, dynamics-relevant information underspecified. By clarifying when auxiliaries supply complementary signal and how to weight them, the paper offers a recipe for steering JEPA latents toward the information a world model needs for prediction and planning, rather than treating auxiliary objectives as an unprincipled add-on.

Strengths & limitations

  • + Replaces trial-and-error with an alignment-based account of auxiliary benefit.
  • + Offers actionable guidance on task choice and loss weighting.
  • + Connects auxiliary design to the structure of the target representation.
  • "Alignment" can be hard to measure a priori for a novel task.
  • Optimal weights remain partly empirical and dataset-dependent.

Connections & references

Builds onI-JEPAV-JEPA