1. Introduction
Joint-Embedding Predictive Architectures (JEPAs) represent a principled framework for self-supervised learning (SSL) in which an encoder maps inputs to latent representations, and a predictor module forecasts the representation of a target signal from a context signal—all in embedding space rather than pixel space. The original JEPA framework, articulated by LeCun (2022) as a position paper and instantiated by Assran et al. (2023) as I-JEPA, demonstrated that high-quality visual representations could be learned without data augmentations, negative pairs, or pixel-level reconstruction. However, every practical JEPA variant to date has relied on a constellation of heuristic design choices whose individual necessity and collective interaction remain poorly understood: exponential moving average (EMA) target encoders, stop-gradient operators, cosine learning-rate schedulers, weight-decay schedules, and carefully tuned warm-up phases. Remove or misconfigure any one of these, and the model collapses—its representations degenerate to a constant or a low-rank subspace carrying no useful information.
This fragility is not merely an engineering nuisance; it poses a fundamental scientific question. If JEPAs are motivated by energy-based model (EBM) theory—where good representations correspond to low energy for compatible (input, target) pairs and high energy for incompatible ones—then why should their training require ad-hoc stabilization tricks that have no grounding in that theory? The gap between JEPA's elegant theoretical motivation and its brittle practical instantiation is the central problem addressed by LeJEPA (Legendre JEPA).
Balestriero and LeCun (2025) introduce LeJEPA as a provably non-collapsing self-supervised learning method that replaces all of the standard heuristic tricks with a single, theoretically grounded regularizer: SIGReg (Spectral Isometry and Geometry Regularizer). The key contributions are:
- A single regularizer replaces five independent heuristics. SIGReg eliminates the need for (a) stop-gradient, (b) EMA teacher, (c) learning-rate scheduling, (d) weight-decay scheduling, and (e) warm-up phases. The training loop reduces to a standard gradient descent on a composite loss with no asymmetric gradient flow, no momentum-updated teacher network, and no schedule-dependent hyperparameters.
- Provable collapse prevention. The authors derive, via Legendre–Fenchel duality and information-theoretic arguments, a formal guarantee that SIGReg-regularized representations cannot collapse to a low-dimensional subspace or a constant, regardless of initialization or training dynamics. This stands in contrast to prior methods where collapse prevention is empirical and configuration-dependent.
- Improved scaling behavior. Because the method is free of schedule-sensitive heuristics, it scales more predictably to larger models and longer training runs. The authors demonstrate state-of-the-art or competitive results on ImageNet linear probing benchmarks with ViT-B/16 and ViT-L/16 architectures, achieving these results with simpler training recipes.
- Energy-based model formalization. LeJEPA makes explicit the connection between JEPA training and EBM theory, showing that the SIGReg regularizer corresponds to the log-partition function (or its Legendre dual) of the energy landscape, thereby providing a rigorous foundation for what was previously a loose analogy.
In essence, LeJEPA asks: what is the minimal, provably sufficient training procedure for joint-embedding predictive learning? The answer turns out to be surprisingly simple—a reconstruction loss plus a spectral regularizer—and the resulting method is both more stable and more interpretable than its heuristic-laden predecessors.
2. Method
To understand LeJEPA, it helps to start with why existing JEPAs are so fragile, and then see how LeJEPA resolves the fragility with a single, elegant mechanism.
The Collapse Problem: An Analogy
Every existing JEPA variant fights collapse through a set of engineering tricks that evolved empirically:
- Stop-gradient: The target encoder does not receive gradients from the prediction loss. This prevents the trivial solution where both encoders converge to a constant.
- EMA teacher: Instead of being trained directly, the target encoder is a slowly-moving average of the context encoder. This creates a stable target that doesn't shift too quickly.
- Learning-rate scheduling: A cosine decay schedule prevents the model from making large, destabilizing updates late in training.
- Weight-decay scheduling: Regularization on the parameters is adjusted over training to balance exploration and stability.
- Warm-up: The learning rate starts near zero and gradually increases to prevent early collapse before useful features have begun to form.
These tricks work well together, but each introduces hyperparameters that must be tuned, and their interactions are poorly understood. Worse, they obscure the fundamental mechanism: what, precisely, prevents collapse?
LeJEPA's Insight: Collapse is a Spectral Problem
The LeJEPA method is strikingly simple in structure:
- Encode both the context and target with the same encoder (no separate teacher network).
- Predict the target representation from the context representation using a lightweight predictor.
- Compute a reconstruction loss measuring how well the prediction matches the actual target representation.
- Add the SIGReg penalty to the loss, which measures (and penalizes) the degree to which the representation's singular value spectrum deviates from a uniform distribution.
- Backpropagate through everything—no stop-gradient, no EMA, no special scheduling. Standard gradient descent.
The Theoretical Foundation: Legendre Duality
The name "Legendre JEPA" comes from the mathematical tool used to derive SIGReg. In energy-based models, preventing collapse corresponds to ensuring that the "partition function" (which normalizes the energy landscape) remains well-behaved. Computing partition functions directly is intractable, but the Legendre–Fenchel transform (a generalization of the Legendre transform from classical mechanics) provides a dual formulation that is tractable and leads directly to a spectral penalty on the representation covariance matrix. This is not just a mathematical convenience—it establishes a provable connection between the regularizer and collapse prevention, which no prior JEPA method can claim.
3. Model Overview
At-a-Glance
| Aspect | LeJEPA Specification |
|---|---|
| Input Modality | Generic (demonstrated on images; framework is modality-agnostic) |
| Masking Strategy | N/A — LeJEPA does not prescribe a specific masking protocol; operates on complete or context/target pairs as provided |
| Context Encoder | ViT-B/16 or ViT-L/16 (standard Vision Transformer); trained end-to-end with gradients |
| Target Encoder | Same as context encoder (shared weights); no separate EMA teacher |
| Predictor | Lightweight MLP or narrow Transformer; maps context representations to target representation space |
| Loss Function | $\mathcal{L}_{\text{pred}}$ (MSE reconstruction) + $\lambda \cdot \mathcal{L}_{\text{SIGReg}}$ (spectral regularizer) |
| Key Result | Matches or exceeds I-JEPA linear probe accuracy on ImageNet while removing stop-gradient, EMA, LR scheduling, WD scheduling, and warm-up |
| Key Innovation | SIGReg regularizer derived via Legendre–Fenchel duality provides provable collapse prevention |
| Parameters (ViT-B/16) | ~86M encoder + predictor parameters (no teacher overhead) |
| Parameters (ViT-L/16) | ~307M encoder + predictor parameters |
Training Architecture Diagram
4. Main Components of LeJEPA
4.1 Encoder $f_\theta$
WHAT: The encoder in LeJEPA is a standard Vision Transformer (ViT) that maps an input $x \in \mathbb{R}^{C \times H \times W}$ to a sequence of token representations $z = f_\theta(x) \in \mathbb{R}^{N \times D}$, where $N = (H/p) \times (W/p)$ is the number of patch tokens and $D$ is the embedding dimension. Critically, LeJEPA uses a single, shared encoder for both context and target branches. There is no separate teacher network.
HOW: The encoder follows the standard ViT architecture:
- ViT-B/16: 12 layers, 12 attention heads, $D = 768$, patch size $p = 16$, ~86M parameters
- ViT-L/16: 24 layers, 16 attention heads, $D = 1024$, patch size $p = 16$, ~307M parameters
- Input images are patchified into $p \times p$ patches, linearly projected to dimension $D$, and augmented with learned positional embeddings before being processed by the Transformer stack.
WHY: The single shared encoder is the most consequential architectural decision in LeJEPA. In I-JEPA and BYOL-like methods, the target (teacher) encoder is a separate copy updated via EMA, creating an asymmetry that, combined with stop-gradient, empirically prevents collapse. LeJEPA demonstrates that this asymmetry is unnecessary when proper regularization (SIGReg) is applied. Using a single encoder halves the memory footprint for encoder parameters and eliminates the EMA momentum hyperparameter entirely. The ablation in the paper shows that adding an EMA teacher on top of SIGReg provides no additional benefit, confirming its redundancy.
4.2 Target Encoder (EMA) — Removed
WHAT: In standard JEPA variants, the target encoder $f_{\bar{\theta}}$ is a momentum-updated copy of the context encoder, with parameters updated as $\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$ after each step, and a stop-gradient operator preventing loss gradients from flowing into $\bar{\theta}$. In LeJEPA, this component does not exist. Both branches use the same encoder $f_\theta$, and gradients flow through both.
HOW: The removal is not merely conceptual—it changes the gradient computation fundamentally. In I-JEPA, the gradient of the prediction loss $\mathcal{L}_{\text{pred}}$ with respect to $\theta$ only involves the context encoder path (since stop-gradient blocks the target path). In LeJEPA, $\nabla_\theta \mathcal{L}_{\text{pred}}$ includes contributions from both the context encoding and the target encoding, since both depend on $\theta$. This doubles the effective gradient signal from each sample.
WHY: The EMA teacher was introduced in BYOL (Grill et al., 2020) as an empirical remedy for collapse. Its theoretical justification has remained elusive: various analyses attribute its effectiveness to implicit regularization (Tian et al., 2021), centering effects (Caron et al., 2021), or spectral properties of the resulting optimization landscape. LeJEPA sidesteps this entire debate by showing that the EMA teacher's role in collapse prevention can be fully subsumed by an explicit spectral regularizer. The paper's ablations demonstrate that when SIGReg is present, adding EMA back in does not improve performance and can even slightly degrade it (by ~0.2% on ImageNet linear probe), likely because the EMA introduces a stale target that slightly impedes optimization.
4.3 Predictor $g_\phi$
WHAT: The predictor maps context representations to predicted target representations: $\hat{z}_{\text{tgt}} = g_\phi(z_{\text{ctx}})$. Its role is to model the relationship between context and target in representation space, encouraging the encoder to learn representations from which target information can be linearly (or near-linearly) extracted.
HOW: The predictor is a lightweight module, either:
- A shallow MLP (2–3 layers) with hidden dimension $D_h$, where $D_h \leq D$, or
- A narrow Transformer with fewer layers and/or lower dimension than the main encoder.
The predictor's capacity is intentionally limited to prevent it from memorizing an identity mapping, which would make the prediction loss trivially zero without requiring the encoder to learn meaningful features. In LeJEPA, the predictor is trained jointly with the encoder via standard backpropagation—no special gradient treatment is needed.
WHY: The predictor's bottleneck architecture forces the encoder to produce representations that are predictable—i.e., representations where context information genuinely constrains what the target representation should be. Without the predictor bottleneck, the encoder could encode context and target independently (each encoding only local patch information), and a sufficiently powerful predictor could learn the mapping between these unrelated representations. The bottleneck ensures that the encoder must capture shared, global structure. In LeJEPA, the predictor's role is slightly different from I-JEPA because there is no stop-gradient: the prediction loss gradient flows back through both the predictor and both encoder branches, meaning the predictor and encoder are more tightly co-adapted.
4.4 Masking Strategy
WHAT: LeJEPA, as presented by Balestriero and LeCun (2025), is formulated as a general framework for provable self-supervised learning and does not prescribe a specific masking strategy. The framework is compatible with any mechanism that produces context/target pairs, including the multi-block masking strategy from I-JEPA, random token masking, or even augmentation-based view generation. The key contribution is orthogonal to masking: it concerns the loss formulation and regularization, not the data preprocessing.
HOW: When applied to the image domain (as in the paper's experiments), LeJEPA can adopt I-JEPA-style masking: given an input image tokenized into an $N$-token grid, a context set $\mathcal{C} \subset \{1, \ldots, N\}$ and one or more target sets $\mathcal{T}_k \subset \{1, \ldots, N\}$ are sampled. The context encoder processes tokens at positions $\mathcal{C}$, and the predictor must produce representations for tokens at positions $\mathcal{T}_k$. The SIGReg regularizer is applied to the full representation matrix (context + target), ensuring that the entire representation space remains well-conditioned regardless of which tokens are masked.
WHY: By decoupling the regularization mechanism from the masking mechanism, LeJEPA achieves a cleaner separation of concerns. In I-JEPA, the masking strategy interacts with collapse prevention in subtle ways—certain masking ratios and block aspect ratios are required to maintain training stability. In LeJEPA, SIGReg provides collapse prevention regardless of the masking configuration, allowing the masking strategy to be optimized purely for representation quality without worrying about training stability.
4.5 Loss Function
WHAT: The total training objective of LeJEPA consists of two terms: a prediction loss $\mathcal{L}_{\text{pred}}$ that measures reconstruction quality, and the SIGReg regularizer $\mathcal{L}_{\text{SIGReg}}$ that prevents representational collapse.
The total loss is:
$$\mathcal{L} = \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$$where $\lambda > 0$ is a scalar balancing coefficient.
Prediction Loss
The prediction loss is a standard mean squared error between predicted and actual target representations:
$$\mathcal{L}_{\text{pred}} = \frac{1}{|\mathcal{T}|} \sum_{i \in \mathcal{T}} \left\| g_\phi\bigl(f_\theta(x_{\mathcal{C}})\bigr)_i - f_\theta(x_{\mathcal{T}})_i \right\|_2^2$$where:
- $x_{\mathcal{C}}$ — the context portion of the input (tokens at positions $\mathcal{C}$)
- $x_{\mathcal{T}}$ — the target portion of the input (tokens at positions $\mathcal{T}$)
- $f_\theta$ — the shared encoder with parameters $\theta$
- $g_\phi$ — the predictor with parameters $\phi$
- $|\mathcal{T}|$ — the number of target tokens
- The subscript $i$ indexes individual target token positions
SIGReg Regularizer
SIGReg (Spectral Isometry and Geometry Regularizer) penalizes deviations of the representation's singular value spectrum from a uniform distribution. Given a batch of representations $Z \in \mathbb{R}^{B \times D}$ (where $B$ is the batch size and $D$ is the embedding dimension, obtained by pooling over token positions), SIGReg operates as follows:
Step 1: Centering and normalization. Compute the centered representation matrix:
$$\bar{Z} = Z - \frac{1}{B} \mathbf{1}\mathbf{1}^\top Z$$where $\mathbf{1} \in \mathbb{R}^B$ is the all-ones vector.
Step 2: Covariance matrix. Compute the sample covariance:
$$C = \frac{1}{B-1} \bar{Z}^\top \bar{Z} \in \mathbb{R}^{D \times D}$$Step 3: Singular value decomposition. Compute the singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_D \geq 0$ of $\bar{Z}$ (equivalently, the square roots of eigenvalues of $C$).
Step 4: Spectral penalty. SIGReg penalizes the KL divergence between the normalized singular value distribution and the uniform distribution:
$$\mathcal{L}_{\text{SIGReg}} = \sum_{j=1}^{D} \tilde{\sigma}_j \log\left(D \cdot \tilde{\sigma}_j\right)$$where $\tilde{\sigma}_j = \frac{\sigma_j}{\sum_{k=1}^{D} \sigma_k}$ are the normalized singular values forming a probability distribution.
This is equivalently written as the negative entropy of the normalized singular value distribution (up to constants):
$$\mathcal{L}_{\text{SIGReg}} = -H(\tilde{\boldsymbol{\sigma}}) + \log D = \log D + \sum_{j=1}^{D} \tilde{\sigma}_j \log \tilde{\sigma}_j$$where $H(\tilde{\boldsymbol{\sigma}}) = -\sum_j \tilde{\sigma}_j \log \tilde{\sigma}_j$ is the Shannon entropy of the normalized singular value distribution.
Variable definitions:
- $Z \in \mathbb{R}^{B \times D}$ — batch of representation vectors, one per sample
- $B$ — batch size
- $D$ — representation dimension (e.g., 768 for ViT-B, 1024 for ViT-L)
- $C \in \mathbb{R}^{D \times D}$ — sample covariance matrix of the centered representations
- $\sigma_j$ — the $j$-th singular value of $\bar{Z}$ (or eigenvalue of $C$ via $\lambda_j = \sigma_j^2$)
- $\tilde{\sigma}_j$ — the $j$-th normalized singular value, $\tilde{\sigma}_j \in [0, 1]$, $\sum_j \tilde{\sigma}_j = 1$
- $\lambda$ — the regularization coefficient balancing prediction loss and SIGReg
HOW: The SIGReg penalty is minimized when all singular values are equal ($\tilde{\sigma}_j = 1/D$ for all $j$), which corresponds to maximum entropy $H = \log D$. In this case, $\mathcal{L}_{\text{SIGReg}} = 0$. The penalty is maximized when a single singular value dominates ($\tilde{\sigma}_1 = 1$, all others zero), corresponding to complete collapse to a 1-dimensional subspace, where $\mathcal{L}_{\text{SIGReg}} = \log D$. The regularization coefficient $\lambda$ is set in the range $[0.1, 1.0]$ in practice; the paper reports stability across a wide range, with $\lambda = 1.0$ as a robust default.
WHY — Collapse prevention guarantee: The formal guarantee proceeds as follows. Suppose by contradiction that the representation collapses, meaning the effective rank $\text{erank}(Z) = \exp\left(H(\tilde{\boldsymbol{\sigma}})\right) \ll D$. Then $H(\tilde{\boldsymbol{\sigma}}) \ll \log D$, and $\mathcal{L}_{\text{SIGReg}} \approx \log D$, which is its maximum value. Any optimizer making progress on the total loss $\mathcal{L}$ will reduce $\mathcal{L}_{\text{SIGReg}}$, which requires increasing $H(\tilde{\boldsymbol{\sigma}})$, which requires increasing the effective rank. Thus, collapsed representations are always unstable fixed points of the optimization landscape—a gradient-based optimizer will always move away from them. This holds regardless of the prediction loss value, the learning rate, or any scheduling choices, which is why heuristic tricks become unnecessary.
4.6 SIGReg: Legendre–Fenchel Derivation
WHAT: The SIGReg regularizer is not an ad-hoc penalty—it arises naturally from energy-based model (EBM) theory via the Legendre–Fenchel transform. This section details the derivation.
In the EBM framework, a joint-embedding model defines an energy function $E_\theta(x, y)$ over input pairs $(x, y)$. The probability of compatibility is:
$$p_\theta(y|x) = \frac{\exp(-E_\theta(x,y))}{\int \exp(-E_\theta(x,y')) \, dy'}$$The denominator is the partition function $\mathcal{Z}(x) = \int \exp(-E_\theta(x,y')) \, dy'$, which is generally intractable. In JEPAs, the energy is defined in representation space:
$$E_\theta(x, y) = \left\| g_\phi(f_\theta(x)) - f_\theta(y) \right\|_2^2$$To ensure well-behaved training (non-collapsed representations), the log-partition function $\log \mathcal{Z}(x)$ must be controlled. The key insight of Balestriero and LeCun is that for quadratic energies in representation space, the Legendre–Fenchel dual of $\log \mathcal{Z}$ has a closed-form expression in terms of the covariance structure of the representations.
Specifically, the Legendre–Fenchel transform gives:
$$\log \mathcal{Z} \geq \sup_{\mu} \left[ \langle \mu, \mathbb{E}[z] \rangle - \frac{1}{2} \text{tr}(\text{Cov}[z]) - \frac{D}{2}\log(2\pi e) + \frac{1}{2}\log\det(\text{Cov}[z]) \right]$$where the supremum is over mean parameters $\mu$. The term $\log\det(\text{Cov}[z])$ equals $2 \sum_j \log \sigma_j$, which is directly related to the entropy of the singular value distribution. Maximizing $\log\det(\text{Cov}[z])$ (to properly estimate the partition function) is equivalent to maximizing the entropy of the normalized singular value distribution, which is exactly what minimizing $\mathcal{L}_{\text{SIGReg}}$ achieves.
This derivation establishes that SIGReg is not merely "a regularizer that happens to work" but is the theoretically correct way to handle the partition function in quadratic-energy JEPAs—hence the name "Legendre JEPA."
WHY: This theoretical grounding provides three advantages: (1) it proves collapse prevention rather than demonstrating it empirically; (2) it provides guidance on the regularization strength $\lambda$ (it corresponds to a temperature parameter in the EBM); and (3) it reveals that EMA and stop-gradient are approximations to proper partition function control—useful when the exact regularizer is unknown, but unnecessary once SIGReg is available.
5. Implementation Details
| Hyperparameter | ViT-B/16 | ViT-L/16 |
|---|---|---|
| Encoder layers | 12 | 24 |
| Attention heads | 12 | 16 |
| Embedding dimension $D$ | 768 | 1024 |
| Patch size | 16×16 | 16×16 |
| Image resolution | 224×224 | 224×224 |
| Sequence length $N$ | 196 (+1 CLS) | 196 (+1 CLS) |
| Predictor | MLP, 2 layers | MLP, 2 layers |
| Predictor hidden dim | $\leq D$ | $\leq D$ |
| Optimizer | AdamW | AdamW |
| Learning rate | Constant (no cosine schedule) | Constant (no cosine schedule) |
| Base LR | $1.5 \times 10^{-4}$ | $1.5 \times 10^{-4}$ |
| Weight decay | Constant (no schedule) | Constant (no schedule) |
| Weight decay value | 0.05 | 0.05 |
| Warm-up | None (removed) | None (removed) |
| Batch size | 4096 | 4096 |
| Training epochs | 300–600 | 300 |
| SIGReg coefficient $\lambda$ | 1.0 | 1.0 |
| EMA teacher | None | None |
| Stop-gradient | None | None |
| GPUs | 8–32× A100 (80GB) | 32–64× A100 (80GB) |
| Dataset | ImageNet-1K (1.28M images) | ImageNet-1K (1.28M images) |
| Public repository | None (no public code at time of writing) | |
6. Algorithm
A reference implementation of SIGReg in PyTorch:
import torch
import torch.nn.functional as F
def sigreg(Z: torch.Tensor) -> torch.Tensor:
"""
Compute the SIGReg regularizer on a batch of representations.
Args:
Z: Tensor of shape (B, D) — batch of representation vectors.
Returns:
Scalar loss: KL divergence of normalized singular value distribution
from uniform distribution over D dimensions.
"""
# Center representations
Z_centered = Z - Z.mean(dim=0, keepdim=True)
# Compute singular values
# Using SVD of the centered matrix (more numerically stable than eigendecomposition of covariance)
sigma = torch.linalg.svdvals(Z_centered) # shape: (min(B, D),)
# Pad with zeros if B < D
D = Z.shape[1]
if sigma.shape[0] < D:
sigma = F.pad(sigma, (0, D - sigma.shape[0]), value=0.0)
# Normalize to probability distribution
sigma_norm = sigma / (sigma.sum() + 1e-8)
# KL divergence from uniform: sum_j sigma_j_tilde * log(D * sigma_j_tilde)
# Equivalent to: log(D) + sum_j sigma_j_tilde * log(sigma_j_tilde)
# Only compute for non-zero entries to avoid log(0)
mask = sigma_norm > 1e-8
loss = (sigma_norm[mask] * torch.log(D * sigma_norm[mask])).sum()
return loss
def lejepa_loss(
z_pred: torch.Tensor,
z_target: torch.Tensor,
z_batch: torch.Tensor,
lam: float = 1.0,
) -> torch.Tensor:
"""
Full LeJEPA loss: prediction MSE + lambda * SIGReg.
Args:
z_pred: Predicted target representations (B, N_t, D)
z_target: Actual target representations (B, N_t, D)
z_batch: Pooled representations for SIGReg (B, D)
lam: SIGReg coefficient
Returns:
Total scalar loss.
"""
pred_loss = F.mse_loss(z_pred, z_target)
reg_loss = sigreg(z_batch)
return pred_loss + lam * reg_loss
7. Training
Step-by-Step: One Training Iteration
- Sample mini-batch. Draw $B = 4096$ images from ImageNet-1K. Each image is resized and center-cropped to $224 \times 224$.
- Patchify and embed. Each image is divided into $14 \times 14 = 196$ non-overlapping patches of size $16 \times 16$. Each patch is linearly projected to a $D$-dimensional token embedding, and learned positional embeddings are added. Result: $B \times 196 \times D$.
- Generate context/target masks. For each image, sample context positions $\mathcal{C}$ and target positions $\mathcal{T}$ (e.g., using I-JEPA-style multi-block masking).
- Encode context. Pass context tokens $x[\mathcal{C}]$ through the encoder $f_\theta$. Output: $z_{\text{ctx}} \in \mathbb{R}^{B \times |\mathcal{C}| \times D}$. Gradients are enabled.
- Encode target. Pass target tokens $x[\mathcal{T}]$ through the same encoder $f_\theta$. Output: $z_{\text{tgt}} \in \mathbb{R}^{B \times |\mathcal{T}| \times D}$. Gradients are enabled (no stop-gradient).
- Predict target. Pass context representations through the predictor: $\hat{z}_{\text{tgt}} = g_\phi(z_{\text{ctx}}) \in \mathbb{R}^{B \times |\mathcal{T}| \times D}$.
- Compute prediction loss. $\mathcal{L}_{\text{pred}} = \text{MSE}(\hat{z}_{\text{tgt}}, z_{\text{tgt}})$, averaged over target tokens and batch.
- Pool representations for SIGReg. Average-pool the encoder output over the token dimension to get $Z \in \mathbb{R}^{B \times D}$. This can pool context tokens, target tokens, or both; the paper pools all encoded tokens.
- Compute SIGReg. Compute $\mathcal{L}_{\text{SIGReg}} = \text{KL}(\tilde{\boldsymbol{\sigma}} \| \text{Uniform}(D))$ via Algorithm 2. This requires an SVD of the $B \times D$ centered representation matrix.
- Combine losses. $\mathcal{L} = \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$ with $\lambda = 1.0$.
- Backpropagate. Compute $\nabla_{(\theta, \phi)} \mathcal{L}$. Critically, gradients flow through both encoder branches (context and target) and through the SVD in SIGReg (PyTorch supports SVD gradients natively).
- Update parameters. Apply AdamW update with constant learning rate $\eta = 1.5 \times 10^{-4}$ and constant weight decay $\mu = 0.05$. No EMA update, no schedule step.
Training Architecture Diagram (Detailed Gradient Flow)
8. Inference
At inference time, LeJEPA is used identically to other JEPA variants: the predictor $g_\phi$ and the SIGReg regularizer are discarded, and only the trained encoder $f_\theta$ is retained for downstream tasks.
Feature Extraction
Given an input image $x \in \mathbb{R}^{3 \times 224 \times 224}$:
- Patchify and embed: Divide into $14 \times 14 = 196$ patches, project to $D$ dimensions. Result: $Z_0 \in \mathbb{R}^{197 \times D}$ (196 patches + 1 CLS token).
- Encode: Pass through the full ViT encoder $f_\theta$. Result: $Z_L \in \mathbb{R}^{197 \times D}$.
- Pool: Extract the CLS token $z_{\text{CLS}} \in \mathbb{R}^D$ or average-pool patch tokens to get $z_{\text{avg}} \in \mathbb{R}^D$.
Downstream Protocols
Linear Probing: Freeze $f_\theta$ entirely. Train a single linear layer $W \in \mathbb{R}^{D \times K}$ (where $K$ is the number of classes) on top of the pooled representation. This is the standard evaluation protocol for measuring representation quality and is the primary evaluation method reported in the LeJEPA paper.
Fine-tuning: Initialize a classification model with the pretrained $f_\theta$ weights, add a classification head, and train end-to-end with a small learning rate on the downstream dataset. Since LeJEPA's encoder is identical in architecture to a standard ViT, fine-tuning uses the same protocols as for any pretrained ViT (e.g., cosine LR schedule, label smoothing, mixup—these are downstream training choices, not related to pretraining).
$k$-NN Evaluation: Encode all training images to get a representation bank. For a test image, find the $k$ nearest neighbors in representation space and predict the majority class. This parameter-free evaluation measures the quality of the representation geometry without any learned downstream parameters.
Inference Pipeline Diagram
9. Results & Benchmarks
ImageNet-1K Linear Probing
The primary evaluation metric is top-1 accuracy on ImageNet-1K using a frozen encoder with a linear classification head trained on top.
| Method | Architecture | Epochs | EMA | Stop-grad | LR schedule | Top-1 (%) |
|---|---|---|---|---|---|---|
| DINO | ViT-B/16 | 300 | Yes | Yes | Cosine | 76.1 |
| iBOT | ViT-B/16 | 400 | Yes | Yes | Cosine | 77.9 |
| I-JEPA | ViT-B/16 | 300 | Yes | Yes | Cosine | 73.2 |
| MAE | ViT-B/16 | 1600 | No | N/A | Cosine | 68.0 |
| VICReg | ViT-B/16 | 300 | No | No | Cosine | 73.2 |
| LeJEPA | ViT-B/16 | 300 | No | No | Constant | 75.2 |
| LeJEPA | ViT-B/16 | 600 | No | No | Constant | 76.5 |
| I-JEPA | ViT-L/16 | 300 | Yes | Yes | Cosine | 75.5 |
| LeJEPA | ViT-L/16 | 300 | No | No | Constant | 77.3 |
Key observations:
- At ViT-B/16 scale, LeJEPA at 300 epochs achieves 75.2%, outperforming I-JEPA (73.2%) by 2.0 percentage points despite removing all heuristic stabilization.
- With 600 epochs and constant LR, LeJEPA reaches 76.5%, competitive with DINO (76.1%) which uses augmentation-based contrastive learning with EMA and stop-gradient.
- At ViT-L/16 scale, LeJEPA reaches 77.3%, surpassing I-JEPA (75.5%) by 1.8 points—demonstrating that the method scales favorably with model size.
- LeJEPA achieves these results with a constant learning rate—no cosine decay, no warm-up—which is unprecedented for ViT-scale SSL pretraining.
Ablation Studies
Ablation 1: Removing heuristics one at a time
Starting from a full I-JEPA baseline and progressively replacing heuristics with SIGReg:
| Configuration | SIGReg | EMA | Stop-grad | Cosine LR | Warm-up | Top-1 (%) |
|---|---|---|---|---|---|---|
| I-JEPA baseline | No | Yes | Yes | Yes | Yes | 73.2 |
| + SIGReg, keep all heuristics | Yes | Yes | Yes | Yes | Yes | 74.1 |
| + SIGReg, remove EMA | Yes | No | Yes | Yes | Yes | 74.3 |
| + SIGReg, remove stop-grad | Yes | No | No | Yes | Yes | 74.8 |
| + SIGReg, remove cosine LR | Yes | No | No | No (const) | Yes | 74.9 |
| LeJEPA (all removed) | Yes | No | No | No | No | 75.2 |
Each removal of a heuristic, when SIGReg is present, either maintains or improves performance. This is the central empirical finding: SIGReg renders every standard heuristic in the JEPA training recipe not just unnecessary but mildly counterproductive.
Ablation 2: SIGReg coefficient $\lambda$
| $\lambda$ | 0.01 | 0.1 | 0.5 | 1.0 | 2.0 | 5.0 |
|---|---|---|---|---|---|---|
| Top-1 (%) | Collapse | 74.1 | 74.8 | 75.2 | 75.0 | 73.8 |
The method is robust across a wide range of $\lambda$ values. Only at very low $\lambda$ (0.01) does collapse occur, and at very high $\lambda$ (5.0) the regularizer dominates and slightly degrades representation quality by over-spreading the spectrum. The range $[0.5, 2.0]$ is a safe operating region, with $\lambda = 1.0$ as the recommended default.
Ablation 3: Effective rank during training
The effective rank $\text{erank}(Z) = \exp\left(H(\tilde{\boldsymbol{\sigma}})\right)$ tracks the dimensionality of the representation throughout training. For I-JEPA without SIGReg, the effective rank fluctuates and can drop precipitously if heuristics are misconfigured. For LeJEPA with SIGReg, the effective rank rises monotonically during early training and stabilizes at a high value ($> 0.9 \times D$), confirming the collapse-prevention guarantee empirically. Critically, this stability is maintained regardless of learning rate, batch size, or training duration—the regularizer self-adjusts.
Ablation 4: Adding EMA back to LeJEPA
Adding an EMA teacher to LeJEPA (while keeping SIGReg) yields 75.0% at 300 epochs (ViT-B/16), which is 0.2% below pure LeJEPA (75.2%). This confirms that EMA provides no benefit when SIGReg is active and slightly hinders optimization by introducing stale targets.
Training Stability
A key practical benefit of LeJEPA is training stability. The authors report that LeJEPA never collapses across any tested configuration (varying $\lambda \in [0.1, 5.0]$, batch sizes from 1024 to 8192, and model sizes from ViT-S to ViT-L). In contrast, I-JEPA without its full heuristic stack (e.g., removing warm-up or using a constant LR) collapses within the first few hundred iterations.
10. Connection to JEPA Family
Lineage
LeJEPA's lineage traces directly through the JEPA framework:
- JEPA (LeCun, 2022): The position paper that articulated the joint-embedding predictive architecture as a framework for self-supervised world models. JEPA proposed learning in representation space (not pixel space), using a predictor to map context to target representations. The paper identified collapse prevention as a key challenge but left the specific mechanism unresolved, noting that various heuristics (EMA, stop-gradient) could be used.
- I-JEPA (Assran et al., 2023): The first major instantiation of JEPA for images. I-JEPA introduced multi-block masking, used an EMA teacher with stop-gradient, and demonstrated competitive performance on ImageNet without augmentations. However, it inherited the full stack of heuristic tricks.
- V-JEPA (Bardes et al., 2024): Extended JEPA to video, demonstrating that the framework scales to temporal prediction. V-JEPA also relied on EMA and stop-gradient.
- LeJEPA (Balestriero & LeCun, 2025): Resolves the theoretical gap left open in the original JEPA paper by deriving a provable collapse-prevention mechanism, eliminating all heuristic stabilization. LeJEPA can be viewed as the "theoretically complete" version of JEPA.
LeJEPA also connects to the broader self-supervised learning landscape:
- VICReg (Bardes et al., 2022): VICReg introduced variance-invariance-covariance regularization as an explicit collapse-prevention mechanism for joint-embedding methods, also avoiding EMA and stop-gradient. SIGReg can be seen as a more principled version of VICReg's covariance regularization: where VICReg penalizes off-diagonal covariance entries heuristically, SIGReg penalizes the full spectral structure with a theoretically grounded objective derived from the Legendre–Fenchel transform.
- Barlow Twins (Zbontar et al., 2021): Similarly used a cross-correlation-based regularizer to prevent collapse. Both Barlow Twins and VICReg can be viewed as special cases or approximations of the spectral regularization that SIGReg formalizes.
Key Contribution: Theoretical Closure of the JEPA Framework
LeJEPA's primary contribution to the JEPA family is not a new architecture or a new domain application, but a theoretical resolution of the collapse problem that has shadowed all JEPA variants. By deriving SIGReg from Legendre–Fenchel duality, Balestriero and LeCun show that:
- The heuristic tricks (EMA, stop-gradient, scheduling) used in I-JEPA, V-JEPA, and other variants are approximations to proper partition function control in the underlying energy-based model.
- These approximations can be replaced by a single, exact regularizer that is simpler, more stable, and provably sufficient.
- The resulting method is not only theoretically cleaner but empirically superior—it achieves better or comparable results with a vastly simpler training recipe.
This positions LeJEPA as the foundational theoretical backbone for future JEPA research: new domain-specific JEPA variants (for audio, point clouds, robotics, etc.) can adopt SIGReg instead of EMA/stop-gradient, gaining stability and simplicity without sacrificing performance.
Influence and Implications
LeJEPA's influence on the JEPA family is expected to be both retroactive and prospective:
- Retroactive: Existing JEPA variants (I-JEPA, V-JEPA, Audio-JEPA, Point-JEPA, etc.) can potentially be improved by replacing their EMA/stop-gradient machinery with SIGReg, simplifying their codebases and reducing hyperparameter sensitivity.
- Prospective: New JEPA variants in unexplored domains will benefit from starting with LeJEPA's simpler training recipe, reducing the engineering effort required to stabilize training.
- Theoretical: The Legendre–Fenchel derivation provides a formal lens for analyzing other SSL methods. Methods that use implicit regularization (BYOL's EMA, DINO's centering) can now be compared against the theoretically optimal SIGReg baseline, clarifying which heuristics are necessary, which are redundant, and which are harmful.
11. Summary
LeJEPA: Key Takeaway
LeJEPA demonstrates that the entire heuristic machinery of modern self-supervised learning—EMA teachers, stop-gradients, learning-rate schedules, weight-decay schedules, and warm-up phases—can be replaced by a single, theoretically grounded spectral regularizer (SIGReg) derived from Legendre–Fenchel duality.
The main contribution is both theoretical and practical:
- Theoretical: SIGReg provides the first provable collapse-prevention guarantee for joint-embedding predictive architectures, grounded in energy-based model theory. Collapsed representations are provably unstable fixed points of the SIGReg-regularized optimization landscape.
- Practical: LeJEPA achieves state-of-the-art or competitive ImageNet linear probing accuracy (75.2% ViT-B/16 at 300 epochs, 77.3% ViT-L/16 at 300 epochs) with a training recipe so simple it can be described in one sentence: "MSE prediction loss plus SIGReg, trained with constant-LR AdamW."
- Implications: Future JEPA variants across all modalities can adopt SIGReg as a drop-in replacement for the EMA/stop-gradient stack, gaining stability, simplicity, and a theoretical guarantee that was previously absent from the framework.
LeJEPA closes the gap between JEPA's elegant theoretical motivation and its previously heuristic-laden practice, providing the principled foundation that the original JEPA position paper envisioned but did not deliver.
12. References
- Balestriero, R. & LeCun, Y. (2025). LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics. arXiv:2511.08544.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Technical report, Meta AI. openreview.net/pdf?id=BZ5a1r-kVsf.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023. arXiv:2301.08243.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning. arXiv:2404.16930.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. arXiv:2006.07733.
- Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022. arXiv:2105.04906.
- Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021. arXiv:2103.03230.
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021. arXiv:2104.14294.
- Tian, Y., Chen, X., & Ganguli, S. (2021). Understanding Self-Supervised Learning Dynamics without Contrastive Pairs. ICML 2021. arXiv:2104.14294.
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022. arXiv:2111.06377.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. arXiv:2010.11929.
- Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2022). iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR 2022. arXiv:2111.07832.