Understanding SSL Dynamics without Contrastive Pairs

At a glance

ProblemMethods like BYOL avoid collapse with no negative pairs, which seemed paradoxical and unexplained.

Key ideaAnalyse the coupled predictor/EMA gradient dynamics; show they prevent collapse, then derive DirectPred from the theory.

ModalityTheory (non-contrastive SSL)

Target / maskingOnline encoder, EMA target encoder with stop-gradient, and a linear predictor on the online branch.

Builds onBYOL-style self-distillation without negatives.

Used forThe theoretical bedrock for JEPA's reliance on EMA target encoders, predictors and stop-gradients.

Motivation

BYOL and similar methods avoid representation collapse without any negative pairs, which seemed paradoxical: with nothing pushing embeddings apart, why do they not collapse to a constant? Understanding Self-Supervised Learning Dynamics without Contrastive Pairs (Tian et al., 2021) gives a foundational theoretical account, and from it derives DirectPred. The setup mirrors the JEPA template — an online (context) encoder, an EMA target encoder with stop-gradient, and a linear predictor on the online branch — so the analysis speaks directly to why JEPA's architecture works.

How it works

Canonical JEPA schematic for View pair. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The authors analyse the coupled gradient dynamics of the predictor and the encoders. The central finding is that the predictor together with the EMA target is what prevents collapse: the predictor's weights evolve to align with the eigenstructure of the feature correlation matrix, and this interaction stabilises the representation away from the degenerate constant solution. Collapse avoidance therefore emerges from the optimisation dynamics, not from any explicit repulsion term. The EMA's slow target and the predictor's adaptive alignment create the asymmetry that keeps embeddings full-rank.

The analysis

Tracking the predictor weight $W_p$ and the feature correlation $F$, the dynamics show $W_p$ converging toward alignment with the eigenbasis of $F$. This motivates DirectPred, which sets the predictor analytically from the feature covariance rather than learning it by gradient descent:

$$W_p \;=\; U\,\operatorname{diag}\!\big(\sqrt{\lambda_i} + \epsilon\big)\,U^\top, \qquad F = U\,\operatorname{diag}(\lambda_i)\,U^\top.$$

Closing the predictor in this form matches or improves performance, validating the dynamical account of why non-contrastive learning stays full-rank.

Why it matters

This analysis is the theoretical bedrock for the JEPA family's reliance on EMA target encoders, predictors and stop-gradients. It explains why these choices are safe and effective: latent-prediction world models can learn non-trivial, full-rank state representations from prediction alone, with each component playing a precise role in keeping the learned dynamics meaningful. Later theory and analyses of JEPA build on this account of collapse avoidance through dynamics.

Strengths & limitations

+ Resolves the apparent paradox of negative-free SSL with concrete dynamics.
+ Yields DirectPred, a closed-form predictor that validates the theory.
+ Foundational for the JEPA EMA/predictor/stop-gradient recipe.
− Core analysis uses a linear predictor and idealised assumptions.
− Explains stability of the recipe but not how to remove it entirely (cf. heuristics-free LeJEPA).

Connections & references

Paper ↗