Motivation
The core JEPA loss trains a context encoder to predict the EMA target encoder's embeddings of masked regions. Practitioners often add auxiliary tasks — extra predictive or reconstructive objectives — but the effect is inconsistent: sometimes they sharpen representations, sometimes they dilute the primary signal. Why and How Auxiliary Tasks Improve JEPA Representations (Yu et al., 2025) studies this systematically, aiming to replace folklore with an account of the conditions under which auxiliaries yield better downstream representations.
How it works
The study treats the JEPA loss as the primary signal and analyses how additional objectives interact with it. Masked latent prediction is indifferent to certain factors of variation; an auxiliary task can supply gradient for exactly those factors, enriching what the encoder captures. Auxiliaries can also act as extra anti-collapse pressure or reshape the geometry of the embedding space. The decisive variable is alignment between the auxiliary task and the structure of the target representation: well-aligned tasks add complementary, non-redundant signal, whereas misaligned tasks compete with the primary objective and distort the learned features.
The analysis
Training optimises a weighted combination of the primary and auxiliary losses,
$$\mathcal{L} = \mathcal{L}_{\text{JEPA}} \;+\; \sum_k \beta_k\,\mathcal{L}^{\text{aux}}_k,$$
and the analysis examines when adding $\mathcal{L}^{\text{aux}}_k$ improves the downstream-relevant subspace of the representation. The benefit grows when the auxiliary gradient is positively aligned with directions that the masked-prediction loss leaves underdetermined, and turns harmful when it conflicts with them — giving both an explanation and prescriptive guidance on selecting and weighting the $\beta_k$.
Why it matters
Real world models must encode many factors of variation, and pure masked prediction can leave controllable, dynamics-relevant information underspecified. By clarifying when auxiliaries supply complementary signal and how to weight them, the paper offers a recipe for steering JEPA latents toward the information a world model needs for prediction and planning, rather than treating auxiliary objectives as an unprincipled add-on.
Strengths & limitations
- + Replaces trial-and-error with an alignment-based account of auxiliary benefit.
- + Offers actionable guidance on task choice and loss weighting.
- + Connects auxiliary design to the structure of the target representation.
- − "Alignment" can be hard to measure a priori for a novel task.
- − Optimal weights remain partly empirical and dataset-dependent.