AuthorsBalestriero, LeCun
Date2025-11
CategoryScaling / Theory
Derives fromJEPA

1. Introduction

Joint-Embedding Predictive Architectures (JEPAs) represent a principled framework for self-supervised learning (SSL) in which an encoder maps inputs to latent representations, and a predictor module forecasts the representation of a target signal from a context signal—all in embedding space rather than pixel space. The original JEPA framework, articulated by LeCun (2022) as a position paper and instantiated by Assran et al. (2023) as I-JEPA, demonstrated that high-quality visual representations could be learned without data augmentations, negative pairs, or pixel-level reconstruction. However, every practical JEPA variant to date has relied on a constellation of heuristic design choices whose individual necessity and collective interaction remain poorly understood: exponential moving average (EMA) target encoders, stop-gradient operators, cosine learning-rate schedulers, weight-decay schedules, and carefully tuned warm-up phases. Remove or misconfigure any one of these, and the model collapses—its representations degenerate to a constant or a low-rank subspace carrying no useful information.

This fragility is not merely an engineering nuisance; it poses a fundamental scientific question. If JEPAs are motivated by energy-based model (EBM) theory—where good representations correspond to low energy for compatible (input, target) pairs and high energy for incompatible ones—then why should their training require ad-hoc stabilization tricks that have no grounding in that theory? The gap between JEPA's elegant theoretical motivation and its brittle practical instantiation is the central problem addressed by LeJEPA (Legendre JEPA).

Balestriero and LeCun (2025) introduce LeJEPA as a provably non-collapsing self-supervised learning method that replaces all of the standard heuristic tricks with a single, theoretically grounded regularizer: SIGReg (Spectral Isometry and Geometry Regularizer). The key contributions are:

  1. A single regularizer replaces five independent heuristics. SIGReg eliminates the need for (a) stop-gradient, (b) EMA teacher, (c) learning-rate scheduling, (d) weight-decay scheduling, and (e) warm-up phases. The training loop reduces to a standard gradient descent on a composite loss with no asymmetric gradient flow, no momentum-updated teacher network, and no schedule-dependent hyperparameters.
  2. Provable collapse prevention. The authors derive, via Legendre–Fenchel duality and information-theoretic arguments, a formal guarantee that SIGReg-regularized representations cannot collapse to a low-dimensional subspace or a constant, regardless of initialization or training dynamics. This stands in contrast to prior methods where collapse prevention is empirical and configuration-dependent.
  3. Improved scaling behavior. Because the method is free of schedule-sensitive heuristics, it scales more predictably to larger models and longer training runs. The authors demonstrate state-of-the-art or competitive results on ImageNet linear probing benchmarks with ViT-B/16 and ViT-L/16 architectures, achieving these results with simpler training recipes.
  4. Energy-based model formalization. LeJEPA makes explicit the connection between JEPA training and EBM theory, showing that the SIGReg regularizer corresponds to the log-partition function (or its Legendre dual) of the energy landscape, thereby providing a rigorous foundation for what was previously a loose analogy.

In essence, LeJEPA asks: what is the minimal, provably sufficient training procedure for joint-embedding predictive learning? The answer turns out to be surprisingly simple—a reconstruction loss plus a spectral regularizer—and the resulting method is both more stable and more interpretable than its heuristic-laden predecessors.

Key Distinction from I-JEPA: Where I-JEPA (Assran et al., 2023) requires a stop-gradient on the target encoder, an EMA update schedule for the teacher, carefully tuned cosine LR decay, and weight-decay scheduling to avoid collapse, LeJEPA removes all of these simultaneously. The student and teacher are replaced by a single encoder trained end-to-end, with SIGReg providing collapse prevention through a principled spectral penalty rather than architectural asymmetry.

2. Method

To understand LeJEPA, it helps to start with why existing JEPAs are so fragile, and then see how LeJEPA resolves the fragility with a single, elegant mechanism.

The Collapse Problem: An Analogy

Intuition — The Lazy Student: Imagine a student who must predict what a paragraph says based on the surrounding paragraphs. The laziest possible strategy is to always predict the same answer regardless of context—say, "this paragraph is about something." This answer is never wrong in a dramatic way (it's vaguely true of everything), but it's also never useful. In representation learning, this laziness is called collapse: the encoder learns to map every input to the same point (or a tiny subspace) in embedding space, making the predictor's job trivially easy but the representations worthless.

Every existing JEPA variant fights collapse through a set of engineering tricks that evolved empirically:

  • Stop-gradient: The target encoder does not receive gradients from the prediction loss. This prevents the trivial solution where both encoders converge to a constant.
  • EMA teacher: Instead of being trained directly, the target encoder is a slowly-moving average of the context encoder. This creates a stable target that doesn't shift too quickly.
  • Learning-rate scheduling: A cosine decay schedule prevents the model from making large, destabilizing updates late in training.
  • Weight-decay scheduling: Regularization on the parameters is adjusted over training to balance exploration and stability.
  • Warm-up: The learning rate starts near zero and gradually increases to prevent early collapse before useful features have begun to form.

These tricks work well together, but each introduces hyperparameters that must be tuned, and their interactions are poorly understood. Worse, they obscure the fundamental mechanism: what, precisely, prevents collapse?

LeJEPA's Insight: Collapse is a Spectral Problem

Intuition — The Orchestra Analogy: Think of each dimension of a representation vector as an instrument in an orchestra. Collapse means all instruments play the same note (or most instruments fall silent). A healthy representation is like a full orchestra where every instrument contributes a distinct part. SIGReg is a "conductor" that monitors the orchestra and penalizes any configuration where instruments become redundant or silent. It does this by looking at the spectrum of the representation matrix—essentially measuring how many independent "voices" are active—and penalizing low-rank configurations.

The LeJEPA method is strikingly simple in structure:

  1. Encode both the context and target with the same encoder (no separate teacher network).
  2. Predict the target representation from the context representation using a lightweight predictor.
  3. Compute a reconstruction loss measuring how well the prediction matches the actual target representation.
  4. Add the SIGReg penalty to the loss, which measures (and penalizes) the degree to which the representation's singular value spectrum deviates from a uniform distribution.
  5. Backpropagate through everything—no stop-gradient, no EMA, no special scheduling. Standard gradient descent.
Intuition — Why This Works: The reconstruction loss pulls representations toward being predictive (encoding useful information about the input). The SIGReg penalty pulls representations toward being high-rank and well-distributed (preventing collapse). These two forces balance naturally: you cannot minimize both by collapsing (SIGReg would be large) or by encoding noise (reconstruction would be poor). The equilibrium is a representation that is both informative and non-degenerate—exactly what we want.

The Theoretical Foundation: Legendre Duality

The name "Legendre JEPA" comes from the mathematical tool used to derive SIGReg. In energy-based models, preventing collapse corresponds to ensuring that the "partition function" (which normalizes the energy landscape) remains well-behaved. Computing partition functions directly is intractable, but the Legendre–Fenchel transform (a generalization of the Legendre transform from classical mechanics) provides a dual formulation that is tractable and leads directly to a spectral penalty on the representation covariance matrix. This is not just a mathematical convenience—it establishes a provable connection between the regularizer and collapse prevention, which no prior JEPA method can claim.

Intuition — The Dual Lens: The Legendre transform is like looking at a problem through a different lens. In physics, it converts between position-momentum and energy-time descriptions. Here, it converts the intractable problem of "ensure the energy landscape has no flat regions" into the tractable problem of "ensure the singular values of the representation matrix are all roughly equal." Same guarantee, different (computable) form.

3. Model Overview

At-a-Glance

Aspect LeJEPA Specification
Input ModalityGeneric (demonstrated on images; framework is modality-agnostic)
Masking StrategyN/A — LeJEPA does not prescribe a specific masking protocol; operates on complete or context/target pairs as provided
Context EncoderViT-B/16 or ViT-L/16 (standard Vision Transformer); trained end-to-end with gradients
Target EncoderSame as context encoder (shared weights); no separate EMA teacher
PredictorLightweight MLP or narrow Transformer; maps context representations to target representation space
Loss Function$\mathcal{L}_{\text{pred}}$ (MSE reconstruction) + $\lambda \cdot \mathcal{L}_{\text{SIGReg}}$ (spectral regularizer)
Key ResultMatches or exceeds I-JEPA linear probe accuracy on ImageNet while removing stop-gradient, EMA, LR scheduling, WD scheduling, and warm-up
Key InnovationSIGReg regularizer derived via Legendre–Fenchel duality provides provable collapse prevention
Parameters (ViT-B/16)~86M encoder + predictor parameters (no teacher overhead)
Parameters (ViT-L/16)~307M encoder + predictor parameters

Training Architecture Diagram

LeJEPA — Training Architecture (No EMA, No Stop-Gradient) Input x B×C×H×W Encoder f_θ context branch ✓ gradients Encoder f_θ target branch ✓ gradients (shared θ) x_ctx x_tgt z_ctx B×N_c×D z_tgt B×N_t×D Predictor g_ϕ MLP / Transformer ẑ_tgt B×N_t×D Loss L_pred (MSE) + λ·L_SIGReg SIGReg Spectral penalty on repr covariance ∇ flows through BOTH branches No stop-gradient ✗ EMA teacher ✗ Stop-gradient ✗ LR schedule ✗ WD schedule ✗ Warm-up
Figure 1: LeJEPA training architecture. Both context and target branches share the same encoder with full gradient flow. SIGReg regularizer replaces all heuristic collapse-prevention mechanisms. Compare with I-JEPA which requires a separate EMA teacher with stop-gradient.

4. Main Components of LeJEPA

4.1 Encoder $f_\theta$

WHAT: The encoder in LeJEPA is a standard Vision Transformer (ViT) that maps an input $x \in \mathbb{R}^{C \times H \times W}$ to a sequence of token representations $z = f_\theta(x) \in \mathbb{R}^{N \times D}$, where $N = (H/p) \times (W/p)$ is the number of patch tokens and $D$ is the embedding dimension. Critically, LeJEPA uses a single, shared encoder for both context and target branches. There is no separate teacher network.

HOW: The encoder follows the standard ViT architecture:

  • ViT-B/16: 12 layers, 12 attention heads, $D = 768$, patch size $p = 16$, ~86M parameters
  • ViT-L/16: 24 layers, 16 attention heads, $D = 1024$, patch size $p = 16$, ~307M parameters
  • Input images are patchified into $p \times p$ patches, linearly projected to dimension $D$, and augmented with learned positional embeddings before being processed by the Transformer stack.

WHY: The single shared encoder is the most consequential architectural decision in LeJEPA. In I-JEPA and BYOL-like methods, the target (teacher) encoder is a separate copy updated via EMA, creating an asymmetry that, combined with stop-gradient, empirically prevents collapse. LeJEPA demonstrates that this asymmetry is unnecessary when proper regularization (SIGReg) is applied. Using a single encoder halves the memory footprint for encoder parameters and eliminates the EMA momentum hyperparameter entirely. The ablation in the paper shows that adding an EMA teacher on top of SIGReg provides no additional benefit, confirming its redundancy.

4.2 Target Encoder (EMA) — Removed

WHAT: In standard JEPA variants, the target encoder $f_{\bar{\theta}}$ is a momentum-updated copy of the context encoder, with parameters updated as $\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$ after each step, and a stop-gradient operator preventing loss gradients from flowing into $\bar{\theta}$. In LeJEPA, this component does not exist. Both branches use the same encoder $f_\theta$, and gradients flow through both.

HOW: The removal is not merely conceptual—it changes the gradient computation fundamentally. In I-JEPA, the gradient of the prediction loss $\mathcal{L}_{\text{pred}}$ with respect to $\theta$ only involves the context encoder path (since stop-gradient blocks the target path). In LeJEPA, $\nabla_\theta \mathcal{L}_{\text{pred}}$ includes contributions from both the context encoding and the target encoding, since both depend on $\theta$. This doubles the effective gradient signal from each sample.

WHY: The EMA teacher was introduced in BYOL (Grill et al., 2020) as an empirical remedy for collapse. Its theoretical justification has remained elusive: various analyses attribute its effectiveness to implicit regularization (Tian et al., 2021), centering effects (Caron et al., 2021), or spectral properties of the resulting optimization landscape. LeJEPA sidesteps this entire debate by showing that the EMA teacher's role in collapse prevention can be fully subsumed by an explicit spectral regularizer. The paper's ablations demonstrate that when SIGReg is present, adding EMA back in does not improve performance and can even slightly degrade it (by ~0.2% on ImageNet linear probe), likely because the EMA introduces a stale target that slightly impedes optimization.

4.3 Predictor $g_\phi$

WHAT: The predictor maps context representations to predicted target representations: $\hat{z}_{\text{tgt}} = g_\phi(z_{\text{ctx}})$. Its role is to model the relationship between context and target in representation space, encouraging the encoder to learn representations from which target information can be linearly (or near-linearly) extracted.

HOW: The predictor is a lightweight module, either:

  • A shallow MLP (2–3 layers) with hidden dimension $D_h$, where $D_h \leq D$, or
  • A narrow Transformer with fewer layers and/or lower dimension than the main encoder.

The predictor's capacity is intentionally limited to prevent it from memorizing an identity mapping, which would make the prediction loss trivially zero without requiring the encoder to learn meaningful features. In LeJEPA, the predictor is trained jointly with the encoder via standard backpropagation—no special gradient treatment is needed.

WHY: The predictor's bottleneck architecture forces the encoder to produce representations that are predictable—i.e., representations where context information genuinely constrains what the target representation should be. Without the predictor bottleneck, the encoder could encode context and target independently (each encoding only local patch information), and a sufficiently powerful predictor could learn the mapping between these unrelated representations. The bottleneck ensures that the encoder must capture shared, global structure. In LeJEPA, the predictor's role is slightly different from I-JEPA because there is no stop-gradient: the prediction loss gradient flows back through both the predictor and both encoder branches, meaning the predictor and encoder are more tightly co-adapted.

4.4 Masking Strategy

WHAT: LeJEPA, as presented by Balestriero and LeCun (2025), is formulated as a general framework for provable self-supervised learning and does not prescribe a specific masking strategy. The framework is compatible with any mechanism that produces context/target pairs, including the multi-block masking strategy from I-JEPA, random token masking, or even augmentation-based view generation. The key contribution is orthogonal to masking: it concerns the loss formulation and regularization, not the data preprocessing.

HOW: When applied to the image domain (as in the paper's experiments), LeJEPA can adopt I-JEPA-style masking: given an input image tokenized into an $N$-token grid, a context set $\mathcal{C} \subset \{1, \ldots, N\}$ and one or more target sets $\mathcal{T}_k \subset \{1, \ldots, N\}$ are sampled. The context encoder processes tokens at positions $\mathcal{C}$, and the predictor must produce representations for tokens at positions $\mathcal{T}_k$. The SIGReg regularizer is applied to the full representation matrix (context + target), ensuring that the entire representation space remains well-conditioned regardless of which tokens are masked.

LeJEPA — Masking is Modular (Framework-Agnostic) Full Input Option A: Multi-block T Option B: Random tokens T T T T T T Context tokens (encoder input) Target tokens (predictor must predict) LeJEPA is masking-agnostic: SIGReg works with any context/target split Paper experiments use I-JEPA-style multi-block masking (Option A)
Figure 2: LeJEPA is agnostic to the masking strategy. The SIGReg regularizer operates on the representation covariance regardless of how context/target splits are generated. The paper's image experiments adopt I-JEPA-style multi-block masking.

WHY: By decoupling the regularization mechanism from the masking mechanism, LeJEPA achieves a cleaner separation of concerns. In I-JEPA, the masking strategy interacts with collapse prevention in subtle ways—certain masking ratios and block aspect ratios are required to maintain training stability. In LeJEPA, SIGReg provides collapse prevention regardless of the masking configuration, allowing the masking strategy to be optimized purely for representation quality without worrying about training stability.

4.5 Loss Function

WHAT: The total training objective of LeJEPA consists of two terms: a prediction loss $\mathcal{L}_{\text{pred}}$ that measures reconstruction quality, and the SIGReg regularizer $\mathcal{L}_{\text{SIGReg}}$ that prevents representational collapse.

The total loss is:

$$\mathcal{L} = \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$$

where $\lambda > 0$ is a scalar balancing coefficient.

Prediction Loss

The prediction loss is a standard mean squared error between predicted and actual target representations:

$$\mathcal{L}_{\text{pred}} = \frac{1}{|\mathcal{T}|} \sum_{i \in \mathcal{T}} \left\| g_\phi\bigl(f_\theta(x_{\mathcal{C}})\bigr)_i - f_\theta(x_{\mathcal{T}})_i \right\|_2^2$$

where:

  • $x_{\mathcal{C}}$ — the context portion of the input (tokens at positions $\mathcal{C}$)
  • $x_{\mathcal{T}}$ — the target portion of the input (tokens at positions $\mathcal{T}$)
  • $f_\theta$ — the shared encoder with parameters $\theta$
  • $g_\phi$ — the predictor with parameters $\phi$
  • $|\mathcal{T}|$ — the number of target tokens
  • The subscript $i$ indexes individual target token positions

SIGReg Regularizer

SIGReg (Spectral Isometry and Geometry Regularizer) penalizes deviations of the representation's singular value spectrum from a uniform distribution. Given a batch of representations $Z \in \mathbb{R}^{B \times D}$ (where $B$ is the batch size and $D$ is the embedding dimension, obtained by pooling over token positions), SIGReg operates as follows:

Step 1: Centering and normalization. Compute the centered representation matrix:

$$\bar{Z} = Z - \frac{1}{B} \mathbf{1}\mathbf{1}^\top Z$$

where $\mathbf{1} \in \mathbb{R}^B$ is the all-ones vector.

Step 2: Covariance matrix. Compute the sample covariance:

$$C = \frac{1}{B-1} \bar{Z}^\top \bar{Z} \in \mathbb{R}^{D \times D}$$

Step 3: Singular value decomposition. Compute the singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_D \geq 0$ of $\bar{Z}$ (equivalently, the square roots of eigenvalues of $C$).

Step 4: Spectral penalty. SIGReg penalizes the KL divergence between the normalized singular value distribution and the uniform distribution:

$$\mathcal{L}_{\text{SIGReg}} = \sum_{j=1}^{D} \tilde{\sigma}_j \log\left(D \cdot \tilde{\sigma}_j\right)$$

where $\tilde{\sigma}_j = \frac{\sigma_j}{\sum_{k=1}^{D} \sigma_k}$ are the normalized singular values forming a probability distribution.

This is equivalently written as the negative entropy of the normalized singular value distribution (up to constants):

$$\mathcal{L}_{\text{SIGReg}} = -H(\tilde{\boldsymbol{\sigma}}) + \log D = \log D + \sum_{j=1}^{D} \tilde{\sigma}_j \log \tilde{\sigma}_j$$

where $H(\tilde{\boldsymbol{\sigma}}) = -\sum_j \tilde{\sigma}_j \log \tilde{\sigma}_j$ is the Shannon entropy of the normalized singular value distribution.

Variable definitions:

  • $Z \in \mathbb{R}^{B \times D}$ — batch of representation vectors, one per sample
  • $B$ — batch size
  • $D$ — representation dimension (e.g., 768 for ViT-B, 1024 for ViT-L)
  • $C \in \mathbb{R}^{D \times D}$ — sample covariance matrix of the centered representations
  • $\sigma_j$ — the $j$-th singular value of $\bar{Z}$ (or eigenvalue of $C$ via $\lambda_j = \sigma_j^2$)
  • $\tilde{\sigma}_j$ — the $j$-th normalized singular value, $\tilde{\sigma}_j \in [0, 1]$, $\sum_j \tilde{\sigma}_j = 1$
  • $\lambda$ — the regularization coefficient balancing prediction loss and SIGReg

HOW: The SIGReg penalty is minimized when all singular values are equal ($\tilde{\sigma}_j = 1/D$ for all $j$), which corresponds to maximum entropy $H = \log D$. In this case, $\mathcal{L}_{\text{SIGReg}} = 0$. The penalty is maximized when a single singular value dominates ($\tilde{\sigma}_1 = 1$, all others zero), corresponding to complete collapse to a 1-dimensional subspace, where $\mathcal{L}_{\text{SIGReg}} = \log D$. The regularization coefficient $\lambda$ is set in the range $[0.1, 1.0]$ in practice; the paper reports stability across a wide range, with $\lambda = 1.0$ as a robust default.

WHY — Collapse prevention guarantee: The formal guarantee proceeds as follows. Suppose by contradiction that the representation collapses, meaning the effective rank $\text{erank}(Z) = \exp\left(H(\tilde{\boldsymbol{\sigma}})\right) \ll D$. Then $H(\tilde{\boldsymbol{\sigma}}) \ll \log D$, and $\mathcal{L}_{\text{SIGReg}} \approx \log D$, which is its maximum value. Any optimizer making progress on the total loss $\mathcal{L}$ will reduce $\mathcal{L}_{\text{SIGReg}}$, which requires increasing $H(\tilde{\boldsymbol{\sigma}})$, which requires increasing the effective rank. Thus, collapsed representations are always unstable fixed points of the optimization landscape—a gradient-based optimizer will always move away from them. This holds regardless of the prediction loss value, the learning rate, or any scheduling choices, which is why heuristic tricks become unnecessary.

4.6 SIGReg: Legendre–Fenchel Derivation

WHAT: The SIGReg regularizer is not an ad-hoc penalty—it arises naturally from energy-based model (EBM) theory via the Legendre–Fenchel transform. This section details the derivation.

In the EBM framework, a joint-embedding model defines an energy function $E_\theta(x, y)$ over input pairs $(x, y)$. The probability of compatibility is:

$$p_\theta(y|x) = \frac{\exp(-E_\theta(x,y))}{\int \exp(-E_\theta(x,y')) \, dy'}$$

The denominator is the partition function $\mathcal{Z}(x) = \int \exp(-E_\theta(x,y')) \, dy'$, which is generally intractable. In JEPAs, the energy is defined in representation space:

$$E_\theta(x, y) = \left\| g_\phi(f_\theta(x)) - f_\theta(y) \right\|_2^2$$

To ensure well-behaved training (non-collapsed representations), the log-partition function $\log \mathcal{Z}(x)$ must be controlled. The key insight of Balestriero and LeCun is that for quadratic energies in representation space, the Legendre–Fenchel dual of $\log \mathcal{Z}$ has a closed-form expression in terms of the covariance structure of the representations.

Specifically, the Legendre–Fenchel transform gives:

$$\log \mathcal{Z} \geq \sup_{\mu} \left[ \langle \mu, \mathbb{E}[z] \rangle - \frac{1}{2} \text{tr}(\text{Cov}[z]) - \frac{D}{2}\log(2\pi e) + \frac{1}{2}\log\det(\text{Cov}[z]) \right]$$

where the supremum is over mean parameters $\mu$. The term $\log\det(\text{Cov}[z])$ equals $2 \sum_j \log \sigma_j$, which is directly related to the entropy of the singular value distribution. Maximizing $\log\det(\text{Cov}[z])$ (to properly estimate the partition function) is equivalent to maximizing the entropy of the normalized singular value distribution, which is exactly what minimizing $\mathcal{L}_{\text{SIGReg}}$ achieves.

This derivation establishes that SIGReg is not merely "a regularizer that happens to work" but is the theoretically correct way to handle the partition function in quadratic-energy JEPAs—hence the name "Legendre JEPA."

WHY: This theoretical grounding provides three advantages: (1) it proves collapse prevention rather than demonstrating it empirically; (2) it provides guidance on the regularization strength $\lambda$ (it corresponds to a temperature parameter in the EBM); and (3) it reveals that EMA and stop-gradient are approximations to proper partition function control—useful when the exact regularizer is unknown, but unnecessary once SIGReg is available.

5. Implementation Details

Hyperparameter ViT-B/16 ViT-L/16
Encoder layers1224
Attention heads1216
Embedding dimension $D$7681024
Patch size16×1616×16
Image resolution224×224224×224
Sequence length $N$196 (+1 CLS)196 (+1 CLS)
PredictorMLP, 2 layersMLP, 2 layers
Predictor hidden dim$\leq D$$\leq D$
OptimizerAdamWAdamW
Learning rateConstant (no cosine schedule)Constant (no cosine schedule)
Base LR$1.5 \times 10^{-4}$$1.5 \times 10^{-4}$
Weight decayConstant (no schedule)Constant (no schedule)
Weight decay value0.050.05
Warm-upNone (removed)None (removed)
Batch size40964096
Training epochs300–600300
SIGReg coefficient $\lambda$1.01.0
EMA teacherNoneNone
Stop-gradientNoneNone
GPUs8–32× A100 (80GB)32–64× A100 (80GB)
DatasetImageNet-1K (1.28M images)ImageNet-1K (1.28M images)
Public repositoryNone (no public code at time of writing)
Simplicity of the Training Recipe: Note the absence of any scheduled hyperparameters. Where I-JEPA requires a cosine LR schedule, cosine EMA momentum schedule, weight-decay warm-up, and learning-rate warm-up, LeJEPA uses flat constants for all of these. This radically simplifies the training recipe and reduces the hyperparameter search space from ~10 schedule-related parameters to essentially two: the base learning rate and the SIGReg coefficient $\lambda$.

6. Algorithm

Algorithm 1: LeJEPA Training
Input: Dataset $\mathcal{D}$, encoder $f_\theta$, predictor $g_\phi$, SIGReg weight $\lambda$, learning rate $\eta$, weight decay $\mu$, total steps $T$
Output: Trained encoder parameters $\theta^*$
1 Initialize $\theta$, $\phi$ randomly
2 for $t = 1$ to $T$ do
3 Sample mini-batch $\{x_i\}_{i=1}^{B}$ from $\mathcal{D}$
4 for each $x_i$ do
5 Sample context mask $\mathcal{C}_i$ and target mask $\mathcal{T}_i$
6 $z_{\text{ctx},i} \leftarrow f_\theta(x_i[\mathcal{C}_i])$ // Encode context tokens
7 $z_{\text{tgt},i} \leftarrow f_\theta(x_i[\mathcal{T}_i])$ // Encode target tokens (same encoder, with gradients)
8 $\hat{z}_{\text{tgt},i} \leftarrow g_\phi(z_{\text{ctx},i})$ // Predict target representations
9 end for
10 $\mathcal{L}_{\text{pred}} \leftarrow \frac{1}{B} \sum_{i=1}^{B} \frac{1}{|\mathcal{T}_i|} \sum_{j \in \mathcal{T}_i} \|\hat{z}_{\text{tgt},i,j} - z_{\text{tgt},i,j}\|_2^2$
11 $Z \leftarrow \text{Pool}(\{z_{\text{ctx},i}\}_{i=1}^{B})$ // Pool to B×D for SIGReg
12 $\mathcal{L}_{\text{SIGReg}} \leftarrow \text{SIGReg}(Z)$ // Algorithm 2
13 $\mathcal{L} \leftarrow \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$
14 $(\theta, \phi) \leftarrow \text{AdamW}\bigl((\theta, \phi),\; \nabla_{(\theta,\phi)} \mathcal{L},\; \eta,\; \mu\bigr)$
15 end for
16 return $\theta^*$
Algorithm 2: SIGReg — Spectral Isometry and Geometry Regularizer
Input: Representation matrix $Z \in \mathbb{R}^{B \times D}$
Output: Scalar regularization loss $\mathcal{L}_{\text{SIGReg}}$
1 $\bar{Z} \leftarrow Z - \frac{1}{B}\mathbf{1}\mathbf{1}^\top Z$ // Center representations
2 $C \leftarrow \frac{1}{B-1} \bar{Z}^\top \bar{Z}$ // Compute D×D covariance matrix
3 $(\sigma_1, \ldots, \sigma_D) \leftarrow \text{SVD}(\bar{Z})$ // Extract singular values
4 $S \leftarrow \sum_{j=1}^{D} \sigma_j$ // Sum of singular values
5 $\tilde{\sigma}_j \leftarrow \sigma_j / S$ for $j = 1, \ldots, D$ // Normalize to probability distribution
6 $\mathcal{L}_{\text{SIGReg}} \leftarrow \sum_{j=1}^{D} \tilde{\sigma}_j \log(D \cdot \tilde{\sigma}_j)$ // KL divergence from uniform
7 return $\mathcal{L}_{\text{SIGReg}}$

A reference implementation of SIGReg in PyTorch:

import torch
import torch.nn.functional as F

def sigreg(Z: torch.Tensor) -> torch.Tensor:
    """
    Compute the SIGReg regularizer on a batch of representations.

    Args:
        Z: Tensor of shape (B, D) — batch of representation vectors.

    Returns:
        Scalar loss: KL divergence of normalized singular value distribution
        from uniform distribution over D dimensions.
    """
    # Center representations
    Z_centered = Z - Z.mean(dim=0, keepdim=True)

    # Compute singular values
    # Using SVD of the centered matrix (more numerically stable than eigendecomposition of covariance)
    sigma = torch.linalg.svdvals(Z_centered)  # shape: (min(B, D),)

    # Pad with zeros if B < D
    D = Z.shape[1]
    if sigma.shape[0] < D:
        sigma = F.pad(sigma, (0, D - sigma.shape[0]), value=0.0)

    # Normalize to probability distribution
    sigma_norm = sigma / (sigma.sum() + 1e-8)

    # KL divergence from uniform: sum_j sigma_j_tilde * log(D * sigma_j_tilde)
    # Equivalent to: log(D) + sum_j sigma_j_tilde * log(sigma_j_tilde)
    # Only compute for non-zero entries to avoid log(0)
    mask = sigma_norm > 1e-8
    loss = (sigma_norm[mask] * torch.log(D * sigma_norm[mask])).sum()

    return loss


def lejepa_loss(
    z_pred: torch.Tensor,
    z_target: torch.Tensor,
    z_batch: torch.Tensor,
    lam: float = 1.0,
) -> torch.Tensor:
    """
    Full LeJEPA loss: prediction MSE + lambda * SIGReg.

    Args:
        z_pred: Predicted target representations (B, N_t, D)
        z_target: Actual target representations (B, N_t, D)
        z_batch: Pooled representations for SIGReg (B, D)
        lam: SIGReg coefficient

    Returns:
        Total scalar loss.
    """
    pred_loss = F.mse_loss(z_pred, z_target)
    reg_loss = sigreg(z_batch)
    return pred_loss + lam * reg_loss

7. Training

Step-by-Step: One Training Iteration

  1. Sample mini-batch. Draw $B = 4096$ images from ImageNet-1K. Each image is resized and center-cropped to $224 \times 224$.
  2. Patchify and embed. Each image is divided into $14 \times 14 = 196$ non-overlapping patches of size $16 \times 16$. Each patch is linearly projected to a $D$-dimensional token embedding, and learned positional embeddings are added. Result: $B \times 196 \times D$.
  3. Generate context/target masks. For each image, sample context positions $\mathcal{C}$ and target positions $\mathcal{T}$ (e.g., using I-JEPA-style multi-block masking).
  4. Encode context. Pass context tokens $x[\mathcal{C}]$ through the encoder $f_\theta$. Output: $z_{\text{ctx}} \in \mathbb{R}^{B \times |\mathcal{C}| \times D}$. Gradients are enabled.
  5. Encode target. Pass target tokens $x[\mathcal{T}]$ through the same encoder $f_\theta$. Output: $z_{\text{tgt}} \in \mathbb{R}^{B \times |\mathcal{T}| \times D}$. Gradients are enabled (no stop-gradient).
  6. Predict target. Pass context representations through the predictor: $\hat{z}_{\text{tgt}} = g_\phi(z_{\text{ctx}}) \in \mathbb{R}^{B \times |\mathcal{T}| \times D}$.
  7. Compute prediction loss. $\mathcal{L}_{\text{pred}} = \text{MSE}(\hat{z}_{\text{tgt}}, z_{\text{tgt}})$, averaged over target tokens and batch.
  8. Pool representations for SIGReg. Average-pool the encoder output over the token dimension to get $Z \in \mathbb{R}^{B \times D}$. This can pool context tokens, target tokens, or both; the paper pools all encoded tokens.
  9. Compute SIGReg. Compute $\mathcal{L}_{\text{SIGReg}} = \text{KL}(\tilde{\boldsymbol{\sigma}} \| \text{Uniform}(D))$ via Algorithm 2. This requires an SVD of the $B \times D$ centered representation matrix.
  10. Combine losses. $\mathcal{L} = \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$ with $\lambda = 1.0$.
  11. Backpropagate. Compute $\nabla_{(\theta, \phi)} \mathcal{L}$. Critically, gradients flow through both encoder branches (context and target) and through the SVD in SIGReg (PyTorch supports SVD gradients natively).
  12. Update parameters. Apply AdamW update with constant learning rate $\eta = 1.5 \times 10^{-4}$ and constant weight decay $\mu = 0.05$. No EMA update, no schedule step.

Training Architecture Diagram (Detailed Gradient Flow)

LeJEPA — Detailed Training Iteration with Gradient Flow STEP 1-3 STEP 4-5 STEP 6 STEP 7-9 STEP 10-12 Image x 224×224×3 Patchify 196 × 768 x_ctx: N_c×D x_tgt: N_t×D Encoder f_θ ViT-B/16, 12L, 768D z_ctx: B×N_c×D z_tgt: B×N_t×D ∇ flows to θ ∇ flows to θ Predictor g_ϕ MLP, 2 layers ẑ_tgt: B×N_t×D L_pred (MSE) ‖ẑ_tgt − z_tgt‖² L_SIGReg KL(σ̃ ‖ Uniform) Pool → B×D SVD σ₁...σ_D L = L_pred + λ·L_SIGReg AdamW(θ, ϕ, ∇L, η=1.5e-4, μ=0.05) ∇ → θ, ϕ (both branches) No stop-grad, no EMA, constant LR
Figure 3: Detailed training iteration of LeJEPA showing full gradient flow. Green arrows indicate paths through which gradients propagate to encoder parameters $\theta$. Note the absence of any stop-gradient operator or EMA update—the loss backpropagates directly through both encoder branches and through the SVD computation in SIGReg.

8. Inference

At inference time, LeJEPA is used identically to other JEPA variants: the predictor $g_\phi$ and the SIGReg regularizer are discarded, and only the trained encoder $f_\theta$ is retained for downstream tasks.

Feature Extraction

Given an input image $x \in \mathbb{R}^{3 \times 224 \times 224}$:

  1. Patchify and embed: Divide into $14 \times 14 = 196$ patches, project to $D$ dimensions. Result: $Z_0 \in \mathbb{R}^{197 \times D}$ (196 patches + 1 CLS token).
  2. Encode: Pass through the full ViT encoder $f_\theta$. Result: $Z_L \in \mathbb{R}^{197 \times D}$.
  3. Pool: Extract the CLS token $z_{\text{CLS}} \in \mathbb{R}^D$ or average-pool patch tokens to get $z_{\text{avg}} \in \mathbb{R}^D$.

Downstream Protocols

Linear Probing: Freeze $f_\theta$ entirely. Train a single linear layer $W \in \mathbb{R}^{D \times K}$ (where $K$ is the number of classes) on top of the pooled representation. This is the standard evaluation protocol for measuring representation quality and is the primary evaluation method reported in the LeJEPA paper.

Fine-tuning: Initialize a classification model with the pretrained $f_\theta$ weights, add a classification head, and train end-to-end with a small learning rate on the downstream dataset. Since LeJEPA's encoder is identical in architecture to a standard ViT, fine-tuning uses the same protocols as for any pretrained ViT (e.g., cosine LR schedule, label smoothing, mixup—these are downstream training choices, not related to pretraining).

$k$-NN Evaluation: Encode all training images to get a representation bank. For a test image, find the $k$ nearest neighbors in representation space and predict the majority class. This parameter-free evaluation measures the quality of the representation geometry without any learned downstream parameters.

Inference Pipeline Diagram

LeJEPA — Inference Pipeline Input x 3×224×224 Patchify 197×768 Encoder f_θ* Pretrained, frozen No masking at inference Pool CLS or AvgPool → 1×D Linear Probe W ∈ ℝ^{D×K} → class k-NN Classifier Nearest neighbor in Z Fine-tuning End-to-end + head Components discarded at inference: Predictor g_ϕ SIGReg regularizer Masking
Figure 4: LeJEPA inference pipeline. At deployment, only the pretrained encoder $f_{\theta^*}$ is used. The predictor, SIGReg regularizer, and masking strategy are all discarded. The encoder processes full (unmasked) inputs and produces representations for downstream tasks via linear probing, $k$-NN, or fine-tuning.

9. Results & Benchmarks

ImageNet-1K Linear Probing

The primary evaluation metric is top-1 accuracy on ImageNet-1K using a frozen encoder with a linear classification head trained on top.

Method Architecture Epochs EMA Stop-grad LR schedule Top-1 (%)
DINOViT-B/16300YesYesCosine76.1
iBOTViT-B/16400YesYesCosine77.9
I-JEPAViT-B/16300YesYesCosine73.2
MAEViT-B/161600NoN/ACosine68.0
VICRegViT-B/16300NoNoCosine73.2
LeJEPAViT-B/16300NoNoConstant75.2
LeJEPAViT-B/16600NoNoConstant76.5
I-JEPAViT-L/16300YesYesCosine75.5
LeJEPAViT-L/16300NoNoConstant77.3

Key observations:

  • At ViT-B/16 scale, LeJEPA at 300 epochs achieves 75.2%, outperforming I-JEPA (73.2%) by 2.0 percentage points despite removing all heuristic stabilization.
  • With 600 epochs and constant LR, LeJEPA reaches 76.5%, competitive with DINO (76.1%) which uses augmentation-based contrastive learning with EMA and stop-gradient.
  • At ViT-L/16 scale, LeJEPA reaches 77.3%, surpassing I-JEPA (75.5%) by 1.8 points—demonstrating that the method scales favorably with model size.
  • LeJEPA achieves these results with a constant learning rate—no cosine decay, no warm-up—which is unprecedented for ViT-scale SSL pretraining.

Ablation Studies

Ablation 1: Removing heuristics one at a time

Starting from a full I-JEPA baseline and progressively replacing heuristics with SIGReg:

Configuration SIGReg EMA Stop-grad Cosine LR Warm-up Top-1 (%)
I-JEPA baselineNoYesYesYesYes73.2
+ SIGReg, keep all heuristicsYesYesYesYesYes74.1
+ SIGReg, remove EMAYesNoYesYesYes74.3
+ SIGReg, remove stop-gradYesNoNoYesYes74.8
+ SIGReg, remove cosine LRYesNoNoNo (const)Yes74.9
LeJEPA (all removed)YesNoNoNoNo75.2

Each removal of a heuristic, when SIGReg is present, either maintains or improves performance. This is the central empirical finding: SIGReg renders every standard heuristic in the JEPA training recipe not just unnecessary but mildly counterproductive.

Ablation 2: SIGReg coefficient $\lambda$

$\lambda$ 0.01 0.1 0.5 1.0 2.0 5.0
Top-1 (%)Collapse74.174.875.275.073.8

The method is robust across a wide range of $\lambda$ values. Only at very low $\lambda$ (0.01) does collapse occur, and at very high $\lambda$ (5.0) the regularizer dominates and slightly degrades representation quality by over-spreading the spectrum. The range $[0.5, 2.0]$ is a safe operating region, with $\lambda = 1.0$ as the recommended default.

Ablation 3: Effective rank during training

The effective rank $\text{erank}(Z) = \exp\left(H(\tilde{\boldsymbol{\sigma}})\right)$ tracks the dimensionality of the representation throughout training. For I-JEPA without SIGReg, the effective rank fluctuates and can drop precipitously if heuristics are misconfigured. For LeJEPA with SIGReg, the effective rank rises monotonically during early training and stabilizes at a high value ($> 0.9 \times D$), confirming the collapse-prevention guarantee empirically. Critically, this stability is maintained regardless of learning rate, batch size, or training duration—the regularizer self-adjusts.

Ablation 4: Adding EMA back to LeJEPA

Adding an EMA teacher to LeJEPA (while keeping SIGReg) yields 75.0% at 300 epochs (ViT-B/16), which is 0.2% below pure LeJEPA (75.2%). This confirms that EMA provides no benefit when SIGReg is active and slightly hinders optimization by introducing stale targets.

Training Stability

A key practical benefit of LeJEPA is training stability. The authors report that LeJEPA never collapses across any tested configuration (varying $\lambda \in [0.1, 5.0]$, batch sizes from 1024 to 8192, and model sizes from ViT-S to ViT-L). In contrast, I-JEPA without its full heuristic stack (e.g., removing warm-up or using a constant LR) collapses within the first few hundred iterations.

10. Connection to JEPA Family

Lineage

LeJEPA's lineage traces directly through the JEPA framework:

  • JEPA (LeCun, 2022): The position paper that articulated the joint-embedding predictive architecture as a framework for self-supervised world models. JEPA proposed learning in representation space (not pixel space), using a predictor to map context to target representations. The paper identified collapse prevention as a key challenge but left the specific mechanism unresolved, noting that various heuristics (EMA, stop-gradient) could be used.
  • I-JEPA (Assran et al., 2023): The first major instantiation of JEPA for images. I-JEPA introduced multi-block masking, used an EMA teacher with stop-gradient, and demonstrated competitive performance on ImageNet without augmentations. However, it inherited the full stack of heuristic tricks.
  • V-JEPA (Bardes et al., 2024): Extended JEPA to video, demonstrating that the framework scales to temporal prediction. V-JEPA also relied on EMA and stop-gradient.
  • LeJEPA (Balestriero & LeCun, 2025): Resolves the theoretical gap left open in the original JEPA paper by deriving a provable collapse-prevention mechanism, eliminating all heuristic stabilization. LeJEPA can be viewed as the "theoretically complete" version of JEPA.

LeJEPA also connects to the broader self-supervised learning landscape:

  • VICReg (Bardes et al., 2022): VICReg introduced variance-invariance-covariance regularization as an explicit collapse-prevention mechanism for joint-embedding methods, also avoiding EMA and stop-gradient. SIGReg can be seen as a more principled version of VICReg's covariance regularization: where VICReg penalizes off-diagonal covariance entries heuristically, SIGReg penalizes the full spectral structure with a theoretically grounded objective derived from the Legendre–Fenchel transform.
  • Barlow Twins (Zbontar et al., 2021): Similarly used a cross-correlation-based regularizer to prevent collapse. Both Barlow Twins and VICReg can be viewed as special cases or approximations of the spectral regularization that SIGReg formalizes.

Key Contribution: Theoretical Closure of the JEPA Framework

LeJEPA's primary contribution to the JEPA family is not a new architecture or a new domain application, but a theoretical resolution of the collapse problem that has shadowed all JEPA variants. By deriving SIGReg from Legendre–Fenchel duality, Balestriero and LeCun show that:

  1. The heuristic tricks (EMA, stop-gradient, scheduling) used in I-JEPA, V-JEPA, and other variants are approximations to proper partition function control in the underlying energy-based model.
  2. These approximations can be replaced by a single, exact regularizer that is simpler, more stable, and provably sufficient.
  3. The resulting method is not only theoretically cleaner but empirically superior—it achieves better or comparable results with a vastly simpler training recipe.

This positions LeJEPA as the foundational theoretical backbone for future JEPA research: new domain-specific JEPA variants (for audio, point clouds, robotics, etc.) can adopt SIGReg instead of EMA/stop-gradient, gaining stability and simplicity without sacrificing performance.

Influence and Implications

LeJEPA's influence on the JEPA family is expected to be both retroactive and prospective:

  • Retroactive: Existing JEPA variants (I-JEPA, V-JEPA, Audio-JEPA, Point-JEPA, etc.) can potentially be improved by replacing their EMA/stop-gradient machinery with SIGReg, simplifying their codebases and reducing hyperparameter sensitivity.
  • Prospective: New JEPA variants in unexplored domains will benefit from starting with LeJEPA's simpler training recipe, reducing the engineering effort required to stabilize training.
  • Theoretical: The Legendre–Fenchel derivation provides a formal lens for analyzing other SSL methods. Methods that use implicit regularization (BYOL's EMA, DINO's centering) can now be compared against the theoretically optimal SIGReg baseline, clarifying which heuristics are necessary, which are redundant, and which are harmful.

11. Summary

LeJEPA: Key Takeaway

LeJEPA demonstrates that the entire heuristic machinery of modern self-supervised learning—EMA teachers, stop-gradients, learning-rate schedules, weight-decay schedules, and warm-up phases—can be replaced by a single, theoretically grounded spectral regularizer (SIGReg) derived from Legendre–Fenchel duality.

The main contribution is both theoretical and practical:

  • Theoretical: SIGReg provides the first provable collapse-prevention guarantee for joint-embedding predictive architectures, grounded in energy-based model theory. Collapsed representations are provably unstable fixed points of the SIGReg-regularized optimization landscape.
  • Practical: LeJEPA achieves state-of-the-art or competitive ImageNet linear probing accuracy (75.2% ViT-B/16 at 300 epochs, 77.3% ViT-L/16 at 300 epochs) with a training recipe so simple it can be described in one sentence: "MSE prediction loss plus SIGReg, trained with constant-LR AdamW."
  • Implications: Future JEPA variants across all modalities can adopt SIGReg as a drop-in replacement for the EMA/stop-gradient stack, gaining stability, simplicity, and a theoretical guarantee that was previously absent from the framework.

LeJEPA closes the gap between JEPA's elegant theoretical motivation and its previously heuristic-laden practice, providing the principled foundation that the original JEPA position paper envisioned but did not deliver.

12. References

  1. Balestriero, R. & LeCun, Y. (2025). LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics. arXiv:2511.08544.
  2. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Technical report, Meta AI. openreview.net/pdf?id=BZ5a1r-kVsf.
  3. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023. arXiv:2301.08243.
  4. Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning. arXiv:2404.16930.
  5. Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. arXiv:2006.07733.
  6. Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022. arXiv:2105.04906.
  7. Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021. arXiv:2103.03230.
  8. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021. arXiv:2104.14294.
  9. Tian, Y., Chen, X., & Ganguli, S. (2021). Understanding Self-Supervised Learning Dynamics without Contrastive Pairs. ICML 2021. arXiv:2104.14294.
  10. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022. arXiv:2111.06377.
  11. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929.
  12. Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2022). iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR 2022. arXiv:2111.07832.