At a glance
ProblemJEPAs avoid representational collapse with a stack of heuristics — stop-gradients, teacher-student EMA, output normalisation, learning-rate schedules — that are fragile and hard to tune.
Key ideaProve the optimal embedding distribution is an isotropic Gaussian, then enforce it directly with a single regulariser, SIGReg.
ModalityDomain-agnostic (theory + recipe)
Target / maskingStandard JEPA prediction, but no EMA teacher and no stop-gradient — the two branches share weights.
Builds onVariance-covariance regularisation (VICReg) and the analysis of non-contrastive collapse.
Used forStable, scalable JEPA pretraining across architectures and domains.

Motivation

Joint-embedding methods can collapse to trivial constant outputs. The field's fixes — momentum teachers, stop-gradients, centering, sharpening, carefully scheduled learning rates — work empirically but are heuristic, interact unpredictably, and demand per-setting tuning. LeJEPA asks a cleaner question: what distribution should the embeddings follow, and can we regularise toward it with one principled term instead of a pile of tricks?

How it works

Any modalitytokens · blocksContext encoderf_θTarget encodershared · SIGRegPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)shared weights
Canonical JEPA schematic for Any modality. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

LeJEPA starts from the downstream objective. It shows that, to minimise prediction risk across unknown downstream tasks, the embedding distribution should be an isotropic Gaussian — maximally spread, with no privileged directions. It then introduces SIGReg (Sketched Isotropic Gaussian Regularization), which pushes the empirical embedding distribution toward that ideal using random projections (sketches): along many random 1-D directions, the projected embeddings are tested against a standard normal, and deviations are penalised.

Crucially, this removes the usual machinery: no stop-gradient, no teacher-student/EMA, no schedulers. The JEPA prediction loss provides the learning signal; SIGReg alone prevents collapse. The implementation is about fifty lines with a single trade-off hyperparameter.

The objective

The total loss combines latent prediction with the Gaussian regulariser:

$$\mathcal{L} = \underbrace{\lVert g_\phi(z_{\text{ctx}}) - z_{\text{tgt}}\rVert^2}_{\text{prediction}} \;+\; \lambda\,\underbrace{\operatorname{SIGReg}(Z)}_{\text{embeddings}\,\to\,\mathcal{N}(0,I)}$$

SIGReg estimates the discrepancy between the distribution of sketched embeddings and a standard normal; minimising it drives $Z$ toward isotropy. Both branches share parameters, so gradients flow through the whole network.

Key results & what's novel

LeJEPA turns an empirical recipe into a principled one: a single objective with one hyperparameter, linear time and memory, and stability across architectures (ResNets, ViTs, ConvNets) and domains without bespoke tuning. The theoretical contribution — identifying the isotropic Gaussian as optimal and giving a scalable estimator for it — also underpins the later analysis of when a JEPA recovers a faithful world model.

Strengths & limitations

  • + Heuristics-free: no EMA, stop-gradient, or schedulers to tune.
  • + Theoretically grounded, simple to implement, linear cost, distributed-friendly.
  • The optimality of the isotropic Gaussian rests on assumptions about the downstream-task family; whether it is the best target for every domain (e.g. heavy-tailed biological data) is an empirical question.
  • SIGReg adds a sketching estimator whose variance depends on the number of random projections.

Connections & references