VICReg — World Modeling

At a glance

ProblemJoint-embedding methods can collapse to constant or low-rank outputs; many rely on negatives or fragile EMA/stop-gradient tricks to prevent it.

Key ideaAn explicit three-term regulariser — variance, invariance, covariance — that prevents collapse per-branch without negatives or momentum encoders.

ModalityImages (joint-embedding SSL)

Target / maskingTwo augmented views encoded into embeddings; no masking, no negatives, no predictor required.

Builds onSiamese / joint-embedding self-supervised learning.

Used forA reusable anti-collapse component inside JEPA and dynamics-model objectives.

Motivation

Siamese joint-embedding methods risk collapsing to trivial constant outputs. The common fixes are negative pairs (contrastive learning) or architectural asymmetry (stop-gradient and momentum encoders), both of which are either expensive or hard to tune. VICReg (Bardes et al., 2021) takes a different route: prevent collapse with an explicit, interpretable regulariser applied to the embedding statistics, so the method needs no negatives, no momentum encoder and no predictor. Its three-term loss has since become a standard anti-collapse component reused by several JEPA variants and dynamics models.

How it works

Given two augmented views encoded into embedding batches $Z$ and $Z'$, VICReg combines three terms. Invariance pulls matched embeddings together. Variance keeps each embedding dimension's standard deviation above a threshold, preventing dimensional collapse. Covariance penalises the off-diagonal entries of the embedding covariance matrix, decorrelating dimensions so information spreads across the representation. Crucially, the variance and covariance terms act on each branch independently, which is why no negatives or momentum encoder are needed — though VICReg can still be combined with them.

The objective

The loss sums the three terms over a batch:

$$\mathcal{L} = \lambda\,\underbrace{\tfrac{1}{N}\textstyle\sum_i \lVert z_i - z'_i \rVert^2}_{\text{invariance}} \;+\; \mu\,\underbrace{\textstyle\sum_j \max(0,\,\gamma - \operatorname{std}(z_{\cdot,j}))}_{\text{variance}} \;+\; \nu\,\underbrace{\textstyle\sum_{j\neq k} [\operatorname{Cov}(Z)]_{jk}^2}_{\text{covariance}}$$

The variance hinge forces each dimension to remain active; the covariance penalty removes redundancy between dimensions; the invariance term aligns the views. Together they guarantee full-rank, decorrelated embeddings without explicit negatives.

Key results & what's novel

VICReg matches contrastive and self-distillation methods while giving transparent, tunable control over embedding statistics, and it removes the need for negatives, momentum encoders and predictors. This directly inspired the regularised formulations of JEPAs: when a JEPA predicts in latent space, variance/covariance regularisation ensures the target embeddings carry full-rank, decorrelated information, avoiding the trivial constant-prediction solution. The variance and covariance terms can also be read as enforcing the first two moments of a Gaussian embedding distribution.

Strengths & limitations

+ Explicit, interpretable anti-collapse with no negatives or momentum encoder.
+ Per-branch terms make it modular and easy to drop into other objectives.
+ Directly motivated regularised JEPA and Gaussian-embedding formulations.
− Three loss weights ($\lambda,\mu,\nu$) require balancing.
− Controls only first and second moments, not the full embedding distribution.

Connections & references

Builds onSSL Dynamics

Paper ↗