Motivation
Pixel-reconstruction models spend capacity modelling irreducible observation noise; JEPAs largely avoid this, but the mechanism had not been formalised. How JEPA Avoids Noisy Features (Littwin et al., 2024) supplies a theoretical account by studying a tractable surrogate that keeps JEPA's essential structure — an asymmetric self-distillation setup with a context encoder, an EMA (teacher) target encoder and a predictor, trained on a latent prediction loss. Restricting to the deep linear case makes the learning dynamics analysable while preserving the architectural ingredients believed to drive JEPA's behaviour.
How it works
The deep linear self-distillation model strips away nonlinearities so the gradient flow can be solved or characterised in closed form, while retaining the predictor and the slowly updated EMA target. The authors track how the singular directions of the feature/correlation structure evolve under training. The interaction between the predictor and the EMA target determines which directions are amplified and which are suppressed: directions that are predictable from context — consistent between context and target — accumulate gradient signal and grow, while noisy, unpredictable directions receive vanishing effective gradient and are filtered out. The denoising is a property of the optimisation geometry, not of any explicit regulariser added to the loss.
The analysis
For a deep linear network the latent prediction objective
$$\mathcal{L} = \big\lVert W_p\, W\, x_{\text{ctx}} - \operatorname{sg}[\bar W\, x_{\text{tgt}}] \big\rVert^2$$
has dynamics governed by the covariance of the inputs and the coupling between the trained weights $W$ and the EMA copy $\bar W$. The analysis shows the effective gradient along a feature direction scales with how well that direction is shared between context and target; pure-noise directions, being uncorrelated across the pair, have an effective signal that decays to zero. The model therefore converges to a representation supported on the predictable subspace, with noise provably attenuated.
Why it matters
This complements the slow-feature picture: where slowness describes what JEPAs keep, this work explains the mechanistic how through predictor/EMA dynamics. It gives a principled justification for predicting in latent rather than pixel space — the architecture is provably biased toward structured, predictable signal, exactly the property a world model needs to forecast dynamics without being derailed by irreducible observation noise. It also clarifies the specific role of the predictor and EMA target in shaping the implicit bias.
Strengths & limitations
- + Turns an empirical observation into an analysable mechanism.
- + Isolates the predictor-EMA interaction as the source of denoising.
- + Reinforces the case for latent-space over pixel-space prediction.
- − The deep linear surrogate omits nonlinearities, attention and the full masking process of real JEPAs.
- − Conclusions about exact filtering rest on idealised noise and stationarity assumptions.