Connecting JEPA with Contrastive SSL

At a glance

ProblemJEPA is described as non-contrastive, yet it avoids collapse without explicit negatives — leaving its relation to contrastive learning unclear.

Key ideaShow the JEPA prediction objective can be analysed in the same alignment-versus-uniformity framework as contrastive SSL.

ModalityTheory (representation learning)

Target / maskingContext-to-target latent prediction with EMA target encoder, predictor and stop-gradient.

Builds onContrastive learning (InfoNCE) and non-contrastive self-distillation dynamics.

Used forTransferring contrastive theory and guarantees to predictive JEPA world models.

Motivation

JEPA is usually labelled non-contrastive: it predicts target embeddings from context without explicit negative pairs, relying on architectural asymmetry — EMA target encoder, predictor, stop-gradient — to avoid the degenerate constant solution. Contrastive methods like InfoNCE instead pull positives together and push negatives apart. Connecting JEPA with Contrastive Self-Supervised Learning (Mo et al., 2024) bridges this apparent divide, asking whether the two families are doing fundamentally different things or two views of the same principle.

How it works

Canonical JEPA schematic for View pair. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The work analyses the JEPA prediction objective within the contrastive framework's alignment-versus-uniformity decomposition. Latent prediction supplies the alignment (attraction) term: it pulls the context embedding toward the target embedding of a compatible pair. The collapse-avoidance machinery — the EMA target encoder and the predictor asymmetry — is shown to supply an implicit uniformity (repulsion) pressure that prevents all embeddings from collapsing to a point, playing a role analogous to the negative samples and repulsion term of InfoNCE. The correspondence unifies the JEPA, BYOL-style self-distillation and contrastive families under a single lens.

The analysis

Writing the contrastive objective schematically as an attraction term minus a repulsion term,

$$\mathcal{L}_{\text{contrastive}} = \underbrace{\lVert z - z^{+} \rVert^2}_{\text{alignment}} \;-\; \lambda\,\underbrace{\mathbb{E}\big[\,\text{repulsion}(z, z^{-})\,\big]}_{\text{uniformity}},$$

the analysis identifies the JEPA latent-prediction loss with the alignment term and argues that the predictor/EMA dynamics induce an effective uniformity pressure even though no $z^{-}$ negatives are sampled. Making this map precise lets the authors carry over contrastive guarantees about what is and is not collapsed to the non-contrastive JEPA setting.

Why it matters

The significance is conceptual consolidation. By placing JEPA, self-distillation and contrastive learning on a common footing, the paper makes it easier to reason about the representation quality of predictive latent objectives, to import the well-developed theory of contrastive learning into JEPA world models, and to design hybrid objectives that combine explicit negatives or uniformity terms with latent prediction. It clarifies that the predictive objective at the heart of JEPA inherits familiar alignment-uniformity trade-offs.

Strengths & limitations

+ Unifies three SSL families under one alignment-uniformity view.
+ Lets contrastive theory and intuitions transfer to JEPA.
+ Clarifies the precise role of EMA/predictor as implicit repulsion.
− The implicit uniformity argument is weaker than an explicit negative-sampling guarantee.
− The equivalence relies on modelling assumptions that may not hold exactly for deep nonlinear encoders.

Connections & references

Builds onSSL Dynamics VICReg

RelatedAvoiding Noise I-JEPA LeJEPA

Paper ↗