Motivation
BYOL and similar methods avoid representation collapse without any negative pairs, which seemed paradoxical: with nothing pushing embeddings apart, why do they not collapse to a constant? Understanding Self-Supervised Learning Dynamics without Contrastive Pairs (Tian et al., 2021) gives a foundational theoretical account, and from it derives DirectPred. The setup mirrors the JEPA template — an online (context) encoder, an EMA target encoder with stop-gradient, and a linear predictor on the online branch — so the analysis speaks directly to why JEPA's architecture works.
How it works
The authors analyse the coupled gradient dynamics of the predictor and the encoders. The central finding is that the predictor together with the EMA target is what prevents collapse: the predictor's weights evolve to align with the eigenstructure of the feature correlation matrix, and this interaction stabilises the representation away from the degenerate constant solution. Collapse avoidance therefore emerges from the optimisation dynamics, not from any explicit repulsion term. The EMA's slow target and the predictor's adaptive alignment create the asymmetry that keeps embeddings full-rank.
The analysis
Tracking the predictor weight $W_p$ and the feature correlation $F$, the dynamics show $W_p$ converging toward alignment with the eigenbasis of $F$. This motivates DirectPred, which sets the predictor analytically from the feature covariance rather than learning it by gradient descent:
$$W_p \;=\; U\,\operatorname{diag}\!\big(\sqrt{\lambda_i} + \epsilon\big)\,U^\top, \qquad F = U\,\operatorname{diag}(\lambda_i)\,U^\top.$$
Closing the predictor in this form matches or improves performance, validating the dynamical account of why non-contrastive learning stays full-rank.
Why it matters
This analysis is the theoretical bedrock for the JEPA family's reliance on EMA target encoders, predictors and stop-gradients. It explains why these choices are safe and effective: latent-prediction world models can learn non-trivial, full-rank state representations from prediction alone, with each component playing a precise role in keeping the learned dynamics meaningful. Later theory and analyses of JEPA build on this account of collapse avoidance through dynamics.
Strengths & limitations
- + Resolves the apparent paradox of negative-free SSL with concrete dynamics.
- + Yields DirectPred, a closed-form predictor that validates the theory.
- + Foundational for the JEPA EMA/predictor/stop-gradient recipe.
- − Core analysis uses a linear predictor and idealised assumptions.
- − Explains stability of the recipe but not how to remove it entirely (cf. heuristics-free LeJEPA).