How JEPA Avoids Noisy Features

At a glance

ProblemJEPAs avoid wasting capacity on noise, but why they reject unpredictable features had no formal explanation.

Key ideaAnalyse a tractable deep linear self-distillation surrogate to expose the implicit bias that filters out noisy directions.

ModalityTheory (linear surrogate)

Target / maskingAsymmetric self-distillation: context encoder, EMA target encoder, predictor, latent prediction loss.

Builds onNon-contrastive SSL dynamics and the slow-feature account of JEPA.

Used forJustifying latent-space prediction as provably biased toward structured, predictable signal.

Motivation

Pixel-reconstruction models spend capacity modelling irreducible observation noise; JEPAs largely avoid this, but the mechanism had not been formalised. How JEPA Avoids Noisy Features (Littwin et al., 2024) supplies a theoretical account by studying a tractable surrogate that keeps JEPA's essential structure — an asymmetric self-distillation setup with a context encoder, an EMA (teacher) target encoder and a predictor, trained on a latent prediction loss. Restricting to the deep linear case makes the learning dynamics analysable while preserving the architectural ingredients believed to drive JEPA's behaviour.

How it works

Canonical JEPA schematic for Input pair. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The deep linear self-distillation model strips away nonlinearities so the gradient flow can be solved or characterised in closed form, while retaining the predictor and the slowly updated EMA target. The authors track how the singular directions of the feature/correlation structure evolve under training. The interaction between the predictor and the EMA target determines which directions are amplified and which are suppressed: directions that are predictable from context — consistent between context and target — accumulate gradient signal and grow, while noisy, unpredictable directions receive vanishing effective gradient and are filtered out. The denoising is a property of the optimisation geometry, not of any explicit regulariser added to the loss.

The analysis

For a deep linear network the latent prediction objective

$$\mathcal{L} = \big\lVert W_p\, W\, x_{\text{ctx}} - \operatorname{sg}[\bar W\, x_{\text{tgt}}] \big\rVert^2$$

has dynamics governed by the covariance of the inputs and the coupling between the trained weights $W$ and the EMA copy $\bar W$. The analysis shows the effective gradient along a feature direction scales with how well that direction is shared between context and target; pure-noise directions, being uncorrelated across the pair, have an effective signal that decays to zero. The model therefore converges to a representation supported on the predictable subspace, with noise provably attenuated.

Why it matters

This complements the slow-feature picture: where slowness describes what JEPAs keep, this work explains the mechanistic how through predictor/EMA dynamics. It gives a principled justification for predicting in latent rather than pixel space — the architecture is provably biased toward structured, predictable signal, exactly the property a world model needs to forecast dynamics without being derailed by irreducible observation noise. It also clarifies the specific role of the predictor and EMA target in shaping the implicit bias.

Strengths & limitations

+ Turns an empirical observation into an analysable mechanism.
+ Isolates the predictor-EMA interaction as the source of denoising.
+ Reinforces the case for latent-space over pixel-space prediction.
− The deep linear surrogate omits nonlinearities, attention and the full masking process of real JEPAs.
− Conclusions about exact filtering rest on idealised noise and stationarity assumptions.

Connections & references

Builds onSSL Dynamics Slow Features

Paper ↗