Joint-Embedding Predictive Architecture (JEPA)
1. Introduction
How do humans learn so efficiently about the world? An infant, with remarkably little explicit supervision, builds a rich internal model of physics, object permanence, and cause-and-effect — all within the first few months of life. Modern AI systems, by contrast, require billions of labeled examples, meticulously curated reward functions, or massive generative pretraining to achieve a fraction of this competence. The Joint-Embedding Predictive Architecture (JEPA), proposed by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence, offers a foundational blueprint for closing this gap — not by predicting raw sensory inputs, but by predicting abstract representations of the world in a learned latent space.
The central problem JEPA addresses is deceptively simple: how should a machine learn the structure of the world from observation alone, without requiring reconstruction of every irrelevant detail? Generative models (autoencoders, masked autoencoders, diffusion models) attempt to predict input data in pixel space or token space. This forces the model to waste capacity on modeling every texture gradient, noise pattern, and stochastic detail — information that is often irrelevant for understanding. Contrastive methods (SimCLR, MoCo, CLIP) sidestep pixel prediction but introduce their own pathologies: they require careful negative sampling, are susceptible to dimensional collapse, and their energy landscape only distinguishes "compatible" from "incompatible" pairs without modeling the relationships between compatible inputs. JEPA proposes a third path: predict the representation of one part of the input from the representation of another part, entirely within a learned latent space.
As the foundational architecture of the JEPA family, this work does not derive from any predecessor — it is the predecessor. LeCun's paper establishes the theoretical framework, the architectural principles, and the philosophical motivation that would later give rise to I-JEPA (image), V-JEPA (video), A-JEPA (audio), MC-JEPA (multimodal), and numerous other instantiations. The key contribution is both an architecture and an argument: that non-generative, non-contrastive self-supervised learning in latent space is the most promising path toward machines that can learn world models, reason, and plan.
In this article, we first describe the core method and intuition behind JEPA, contrasting it with generative and contrastive paradigms. We then present the complete architecture with detailed diagrams, dissect every component (encoders, predictor, masking, loss), and formalize the training algorithm. We discuss implementation considerations, analyze the energy-based model perspective, examine how JEPA connects to the broader landscape of self-supervised learning, and conclude with the variant's legacy as the conceptual seed of an entire architectural family.
2. Method
The core idea of JEPA can be stated in one sentence: instead of predicting what you see, predict what it means. More precisely, given one portion of an input (the context), JEPA predicts the latent representation of another portion (the target) — never attempting to reconstruct raw pixels, audio waveforms, or text tokens.
The method proceeds in three conceptual steps:
- Encode the context. A portion of the input $x$ (selected by a masking strategy) is fed through a context encoder $f_\theta$, producing a latent representation $s_x = f_\theta(x_{\text{context}})$. This encoder is trainable via backpropagation.
- Encode the target. The complementary portion of the input (the part that was masked from the context encoder) is fed through a target encoder $f_{\bar{\theta}}$, producing a target representation $s_y = f_{\bar{\theta}}(x_{\text{target}})$. Critically, this target encoder is not trained by gradient descent — its parameters $\bar{\theta}$ are an exponential moving average (EMA) of the context encoder's parameters $\theta$. Gradients do not flow through this path.
- Predict the target representation. A predictor network $g_\phi$ takes the context representation $s_x$ and a specification of what to predict (e.g., which spatial positions were masked), and outputs a predicted representation $\hat{s}_y = g_\phi(s_x, z)$, where $z$ encodes information about the target's location or identity. The training loss is the distance between $\hat{s}_y$ and the (stop-gradient) target $s_y$.
What makes this fundamentally different from a masked autoencoder (MAE)? In an MAE, the prediction target is the raw input — pixels, patches, tokens. The decoder must reconstruct everything, including the irrelevant stochastic details that carry no semantic meaning. In JEPA, the prediction target is the output of a learned encoder that can (and will) discard unpredictable information. The encoder is incentivized to retain only the information that is predictable from context — the semantic, structural, causal content — while filtering out unpredictable noise. This is an implicit form of information filtering that emerges naturally from the architecture, without requiring any explicit information bottleneck.
From an energy-based model (EBM) perspective, JEPA defines an energy function $E(x, y) = \| s_y - g_\phi(s_x, z) \|^2$ that measures the compatibility between an input context $x$ and a target $y$. Low energy indicates that the target is a plausible completion of the context; high energy indicates implausibility. The critical design question for any EBM is: how do you prevent the model from collapsing to a trivial solution where all energies are low? JEPA addresses this through the asymmetric architecture (EMA target encoder, stop-gradient, predictor bottleneck) rather than through explicit contrastive negatives or reconstruction constraints.
3. Model Overview
JEPA is a modality-agnostic architectural template. The 2022 position paper describes the abstract architecture without committing to a specific input type, encoder backbone, or predictor design — these choices are deferred to concrete instantiations (I-JEPA for images, V-JEPA for video, etc.). Here we describe the architecture at its most general, noting where design decisions must be made for any specific realization.
At-a-Glance
| Property | Value |
|---|---|
| Input type | Generic (modality-agnostic) |
| Masking strategy | Abstract — domain-specific (spatial blocks for images, temporal segments for video, etc.) |
| Encoder architecture | Unspecified (any deep network; position paper suggests transformer-based) |
| Predictor type | Narrow network conditioned on mask/position info $z$ |
| Loss function | $\mathcal{L} = \| \hat{s}_y - \text{sg}(s_y) \|^2$ (L2 in latent space) |
| Key result | Theoretical framework — no benchmark numbers (position paper) |
| Parameters | Architecture-dependent; paper provides framework, not specific model |
| Collapse prevention | Asymmetric architecture: EMA target encoder + predictor bottleneck |
| Supervision | Fully self-supervised (no labels, no negatives, no reconstruction) |
4. Main Components of JEPA
4.1 Context Encoder $f_\theta$
The context encoder is the primary trainable feature extractor in JEPA. It takes as input the unmasked portion of the input (the context) and produces a set of latent representations. Formally:
$$s_x = f_\theta(x_{\text{context}}) \in \mathbb{R}^{N_c \times D}$$where $N_c$ is the number of context tokens/patches/positions and $D$ is the representation dimensionality. The architecture of $f_\theta$ is not prescribed by the JEPA framework — it could be a Vision Transformer (ViT), a convolutional network, a graph neural network, or any architecture appropriate for the input modality. However, LeCun's paper strongly suggests transformer-based architectures due to their flexibility in handling variable-length sequences and their natural compatibility with masked prediction tasks.
Key design principles for the context encoder:
- Receives only visible (unmasked) tokens. Unlike MAE-style approaches where mask tokens are also processed, JEPA's context encoder only processes the context region. This is computationally efficient (fewer tokens) and prevents information leakage.
- Positional encoding is essential. Since the predictor must forecast representations at specific target locations, the context encoder must preserve spatial/temporal position information in its outputs. Learnable or sinusoidal positional embeddings are added to the input.
- Trained end-to-end via backpropagation. Gradients from the prediction loss $\mathcal{L}$ flow through the predictor and back into the context encoder, updating $\theta$.
The context encoder has a dual objective that it learns implicitly: it must produce representations that are (1) informative enough for the predictor to forecast target representations, and (2) abstract enough that the prediction task is feasible — filtering out unpredictable noise and retaining predictable structure.
4.2 Target Encoder $f_{\bar{\theta}}$
The target encoder has the same architecture as the context encoder but serves a fundamentally different role: it produces the prediction targets. It processes the masked (target) region of the input:
$$s_y = f_{\bar{\theta}}(x_{\text{target}}) \in \mathbb{R}^{N_t \times D}$$where $N_t$ is the number of target tokens and $\bar{\theta}$ denotes the target encoder parameters. The critical design decisions:
Exponential Moving Average (EMA) update: The target encoder parameters $\bar{\theta}$ are not trained by gradient descent. Instead, after each optimization step, they are updated as:
$$\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$$where $\tau \in [0, 1)$ is the momentum coefficient (typically $\tau \geq 0.996$) and $\theta$ are the current context encoder parameters. This creates a slowly-evolving target that provides stable prediction targets while still tracking the improving encoder.
Stop-gradient: No gradients flow through the target encoder during backpropagation. The target representation $s_y$ is treated as a fixed target for the current optimization step. This is written as $\text{sg}(s_y)$ or $\overline{s_y}$ in the loss function. The combination of EMA and stop-gradient is essential: without them, the system would collapse to the trivial solution where both encoders output a constant, achieving zero prediction loss.
Momentum schedule: In practice (as demonstrated in later instantiations like I-JEPA), the momentum $\tau$ follows a cosine schedule from a starting value $\tau_0$ (e.g., 0.996) to 1.0 over the course of training. Early in training, faster updates ($\tau$ closer to 0.996) allow the target encoder to track the rapidly improving context encoder. Later, slower updates ($\tau$ closer to 1.0) provide more stable targets:
$$\tau_t = 1 - (1 - \tau_0) \cdot \left( \cos\left(\frac{\pi t}{T}\right) + 1 \right) / 2$$where $t$ is the current step and $T$ is the total number of training steps.
4.3 Predictor $g_\phi$
The predictor is the component that makes JEPA fundamentally different from both contrastive and generative methods. Given the context representation $s_x$ and information $z$ about which target positions to predict, the predictor outputs a prediction of the target representation:
$$\hat{s}_y = g_\phi(s_x, z) \in \mathbb{R}^{N_t \times D}$$The predictor must solve a non-trivial mapping problem: from the context representation at positions $\{p_1, \ldots, p_{N_c}\}$, infer what the representation would be at the target positions $\{q_1, \ldots, q_{N_t}\}$. This requires the predictor to learn spatial/temporal/semantic relationships between different parts of the input.
Architecture: LeCun's paper does not prescribe a specific predictor architecture, but the design principle is clear — the predictor should be narrow (lower capacity than the encoders). A narrow predictor creates an information bottleneck that forces the encoders to produce maximally informative representations. If the predictor were as large as the encoder, it could memorize patterns without requiring good encoder representations. Subsequent implementations (I-JEPA, V-JEPA) use narrow transformers with fewer layers and smaller hidden dimensions than the encoder.
Conditioning on target specification $z$: The predictor must know what to predict. The variable $z$ encodes this information — typically as positional embeddings for the target locations. In a vision model, $z$ might be a set of learnable mask tokens at the positions of the masked patches. In a temporal model, $z$ might encode the future time step to predict. The predictor attends to both $s_x$ (the context) and $z$ (the target specification) to produce its output.
Why the predictor matters for collapse prevention: Without a predictor (i.e., if JEPA simply minimized $\|s_x - s_y\|^2$ between context and target representations), the system would be a variant of BYOL/SimSiam and would rely entirely on the EMA asymmetry to prevent collapse. The predictor provides an additional bottleneck: the context encoder is incentivized to produce representations that contain predictive information about the target, while the predictor's limited capacity prevents it from simply copying inputs. This creates a natural information filtering mechanism.
4.4 Masking Strategy
The masking strategy determines how the input is partitioned into context and target regions. This is the primary mechanism that defines the self-supervised prediction task and has an enormous impact on what representations are learned.
LeCun's position paper discusses masking at an abstract level, identifying key principles:
- The target region should be informative. Masking should remove semantically meaningful content that the model must reason about to predict. Random pixel masking is less effective than masking coherent semantic regions.
- The prediction task should be non-trivial. If the context contains almost all the information needed to trivially reconstruct the target, the model learns nothing useful. The masking ratio should be substantial.
- Multi-block masking is preferred over single-block masking. Predicting multiple separate target blocks from a context forces the model to build a more holistic understanding than predicting a single contiguous region.
The position paper argues strongly against random patch masking (as used in MAE) because adjacent patches provide sufficient local context to reconstruct individual missing patches via interpolation — the model never needs to learn high-level semantics. Block masking, where large contiguous regions are removed, forces the model to predict at a higher level of abstraction because local texture information is insufficient to solve the task.
4.5 Loss Function
The JEPA loss function measures the discrepancy between predicted and actual target representations in latent space. The basic form is:
$$\mathcal{L}(\theta, \phi) = \frac{1}{N_t} \sum_{i=1}^{N_t} \left\| \hat{s}_y^{(i)} - \text{sg}\left(s_y^{(i)}\right) \right\|_2^2$$where:
- $\hat{s}_y^{(i)} = g_\phi(s_x, z)_i$ is the predicted representation at target position $i$, produced by the predictor
- $s_y^{(i)} = f_{\bar{\theta}}(x_{\text{target}})_i$ is the actual target representation at position $i$, produced by the target encoder
- $\text{sg}(\cdot)$ denotes the stop-gradient operator — gradients do not flow through this term
- $N_t$ is the number of target positions
- $\theta$ are the context encoder parameters, $\phi$ are the predictor parameters
- $\bar{\theta}$ are the target encoder parameters (EMA of $\theta$, not optimized by this loss)
The loss is minimized with respect to $\theta$ and $\phi$ jointly. Gradients flow through $\hat{s}_y$ back to both the predictor $g_\phi$ and the context encoder $f_\theta$. The target representations $s_y$ are treated as fixed targets for each step.
Why L2 loss works here (but wouldn't for pixel prediction): In pixel space, L2 loss produces blurry predictions because it averages over all possible reconstructions. In latent space, this problem is dramatically reduced: the encoder has already discarded the unpredictable, stochastic details. What remains in the latent space is precisely the deterministic, predictable content — and for this content, L2 is a reasonable distance measure. The encoder is implicitly learned to make L2 an appropriate loss by retaining only what is consistently predictable.
Why this loss prevents collapse (analysis): The most dangerous failure mode for JEPA is representational collapse — where the encoders map all inputs to the same constant vector, trivially achieving zero loss. JEPA prevents this through three complementary mechanisms:
- EMA target encoder + stop-gradient: The target encoder evolves slowly and doesn't receive gradients. The context encoder cannot "negotiate" with the target encoder to jointly collapse — it must adapt to whatever the target encoder produces. Since the target encoder started from a random initialization and evolves slowly, its outputs remain diverse for long enough that the context encoder learns useful features.
- Predictor bottleneck: The predictor is deliberately narrow. If the context encoder produced constant representations, the narrow predictor could not map them to the diverse target representations that the EMA-lagged target encoder still produces. The predictor's limited capacity creates a pressure for the context encoder to produce informative inputs.
- Multi-target prediction: Predicting representations at multiple target positions simultaneously is harder to collapse than predicting a single target. The model must produce position-dependent predictions, which requires non-trivial encoding of the context.
4.6 Energy-Based Model Perspective
LeCun frames JEPA within the broader context of Energy-Based Models (EBMs). An EBM defines a scalar energy function $E(x, y)$ over input pairs, where low energy indicates compatibility and high energy indicates incompatibility. JEPA's energy function is:
$$E_w(x, y) = \left\| f_{\bar{\theta}}(y) - g_\phi\left(f_\theta(x), z\right) \right\|^2$$where $w = (\theta, \phi, \bar{\theta})$ are all the model parameters. This energy is low when the predicted representation of $y$ from context $x$ matches the actual representation of $y$.
The fundamental challenge in EBM training is shaping the energy landscape so that compatible pairs have low energy while incompatible pairs have high energy. There are four principal strategies:
- Contrastive methods: Explicitly push up the energy of negative (incompatible) pairs while pulling down the energy of positive pairs. Examples: SimCLR, MoCo, InfoNCE.
- Regularization methods: Add regularization terms that prevent the energy surface from becoming flat (collapsed). Examples: VICReg, Barlow Twins.
- Architectural methods: Design the architecture so that collapse is structurally discouraged. Examples: JEPA (asymmetric EMA + predictor bottleneck), BYOL.
- Generative/reconstruction methods: Train the model to minimize energy only for observed data, relying on the reconstruction objective to implicitly raise energy for unobserved data. Examples: VAE, MAE.
JEPA relies primarily on the architectural strategy (option 3), optionally supplemented by regularization (option 2). LeCun argues in the paper that this is preferable to contrastive methods because (a) constructing good negative samples is difficult, especially in structured prediction tasks, and (b) contrastive methods have an implicit mode-covering behavior that can lead to overly broad, low-information representations.
4.7 Information Filtering and the Role of Latent Prediction
A distinctive and theoretically important property of JEPA is its implicit information filtering. This emerges naturally from the architecture without any explicit information bottleneck, dropout, or capacity constraint on the encoder (though the predictor bottleneck helps).
Consider what happens during training: the encoder $f_\theta$ learns to map inputs to a representation space where the prediction loss is minimized. Any information in the input that is not predictable from context — random texture variations, sensor noise, precise lighting conditions, exact pixel values — will not help reduce the prediction loss. Including such information in the representation would actually increase the prediction loss, because the predictor would be penalized for failing to predict these unpredictable details. Therefore, the encoder is implicitly trained to discard unpredictable information and retain only the predictable, semantic content.
$$I(s_y; x_{\text{target}}) \leq I(x_{\text{target}}; x_{\text{target}}) = H(x_{\text{target}})$$where $I(\cdot;\cdot)$ denotes mutual information and $H(\cdot)$ denotes entropy. The encoder compresses the input, and the compression is guided by what is predictable from context. Formally, the encoder learns representations that maximize:
$$I(s_y; s_x) \quad \text{subject to} \quad s_y = f_{\bar\theta}(x_\text{target})$$This is the information-theoretic dual of the observation that generative models waste capacity on unpredictable details: JEPA automatically learns to ignore them.
5. Implementation Details
The JEPA position paper by LeCun (2022) is a theoretical framework paper, not an empirical methods paper. It does not provide specific hyperparameters, benchmark numbers, or an implementation. No public repository exists. The specific instantiation of JEPA with concrete implementation details came with I-JEPA (Assran et al., 2023) for images and V-JEPA (Bardes et al., 2024) for video.
However, the position paper does establish architectural principles that constrain any valid implementation. We present these as a reference table, noting which values come from the paper's discussion versus later instantiations.
| Hyperparameter | Value / Recommendation | Source & Notes |
|---|---|---|
| Context Encoder | Deep network (transformer recommended) | Position paper; specific architecture deferred to instantiations |
| Target Encoder | Same architecture as context encoder | Position paper; EMA-updated copy |
| Predictor | Narrow network (fewer layers, smaller hidden dim) | Position paper principle; I-JEPA uses 12-layer narrow transformer |
| Predictor capacity ratio | ~1/4 to 1/2 of encoder capacity | Implicit from position paper discussion; I-JEPA uses 384-dim predictor vs 1024-dim encoder for ViT-L |
| EMA momentum $\tau$ | $\geq 0.996$, schedule to $1.0$ | Not specified in position paper; I-JEPA uses cosine schedule $0.996 \to 1.0$ |
| Masking strategy | Block masking (multi-block preferred) | Position paper; I-JEPA uses 4 target blocks, 85% mask ratio |
| Loss function | L2 (MSE) in latent space | Position paper; later variants explore smooth-L1, cosine similarity |
| Optimizer | Not specified | I-JEPA uses AdamW with $\beta_1=0.9$, $\beta_2=0.95$ |
| Learning rate | Not specified | I-JEPA uses $1.5 \times 10^{-4}$ with cosine decay |
| Batch size | Not specified | I-JEPA uses 2048 |
| Warmup | Not specified | I-JEPA uses 15 epochs |
| Training epochs | Not specified | I-JEPA uses 600 epochs on ImageNet-1K |
| Data augmentation | Minimal to none (by design) | Position paper; JEPA should not require hand-crafted augmentations |
| Normalization | Layer normalization on targets | Common in implementations; prevents loss scale issues |
| Weight decay | Not specified | I-JEPA uses 0.05 |
| Mixed precision | Not specified | I-JEPA uses bf16 for efficiency |
| GPU requirements | Not specified | I-JEPA: 16 A100 GPUs, ~72h for ViT-H/14 on IN-1K |
6. Algorithm
7. Training
A single JEPA training iteration proceeds through the following steps:
- Sample a batch of $B$ input samples $\{x_1, \ldots, x_B\}$ from the dataset. No labels are used.
- Generate masks. For each sample, use the block masking strategy (Algorithm 2) to define $K$ target regions $\{\mathcal{T}_1, \ldots, \mathcal{T}_K\}$ and the complementary context region $\mathcal{C}$. The masks are sampled stochastically — different samples in the batch have different masks, and different training steps use different masks for the same sample.
- Tokenize/patchify the input. The raw input $x$ is converted to a sequence of tokens (e.g., for images, non-overlapping patches are flattened and linearly projected; for audio, spectral frames are embedded). Positional embeddings are added.
- Separate context and target tokens. Using the mask, partition the tokens into context tokens (at positions in $\mathcal{C}$) and target tokens (at positions in $\mathcal{T}_1 \cup \cdots \cup \mathcal{T}_K$).
- Context encoding. Feed the context tokens through the context encoder $f_\theta$ to obtain context representations $s_x \in \mathbb{R}^{N_c \times D}$. This is the primary trainable path.
- Target encoding. Feed the target tokens through the target encoder $f_{\bar{\theta}}$ to obtain target representations $s_y \in \mathbb{R}^{N_t \times D}$. Apply stop-gradient: no gradient is computed for this path.
- Prediction. The predictor $g_\phi$ receives the context representations $s_x$ and positional specifications $z_k$ for each target block, and produces predicted target representations $\hat{s}_{y_k}$ for each block.
- Loss computation. Compute $\mathcal{L} = \frac{1}{K}\sum_k \frac{1}{N_{t_k}} \|\hat{s}_{y_k} - \text{sg}(s_{y_k})\|^2$.
- Backpropagation. Compute gradients $\nabla_\theta \mathcal{L}$ and $\nabla_\phi \mathcal{L}$ and update the context encoder and predictor parameters.
- EMA update. Update the target encoder: $\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau)\theta$.
Mathematical formulation of the training objective. The full JEPA training objective, aggregated over a dataset $\mathcal{D}$ and stochastic masks, is:
$$\min_{\theta, \phi} \; \mathbb{E}_{x \sim \mathcal{D}} \; \mathbb{E}_{\mathcal{M} \sim p(\mathcal{M})} \left[ \frac{1}{K} \sum_{k=1}^{K} \frac{1}{N_{t_k}} \sum_{i=1}^{N_{t_k}} \left\| g_\phi\bigl(f_\theta(x_\mathcal{C}), z_k\bigr)_i - \text{sg}\bigl(f_{\bar{\theta}}(x_{\mathcal{T}_k})\bigr)_i \right\|_2^2 \right]$$where $\mathcal{M}$ denotes the stochastic mask (which determines $\mathcal{C}$ and $\{\mathcal{T}_k\}$), $p(\mathcal{M})$ is the masking distribution (Algorithm 2), and $\bar{\theta}$ is updated via EMA after each optimization step (not optimized by this loss).
8. Inference
After pretraining, JEPA produces a feature extractor that can be deployed for downstream tasks. The inference procedure is substantially simpler than training.
Which encoder is kept? The target encoder $f_{\bar{\theta}}$ is typically used for downstream tasks, as it represents a smoothed version of the context encoder that has been shown empirically to produce slightly better features (analogous to BYOL/MoCo). Some implementations evaluate both and select based on downstream performance, but the target encoder is the default choice.
The predictor is discarded. The predictor $g_\phi$ served its purpose during training — it created the learning signal that drove the encoder to produce good representations. At inference time, only the encoder is needed.
Feature extraction: Given a new input $x$, the full (unmasked) input is fed through the kept encoder to produce a representation:
$$s = f_{\bar{\theta}}(x) \in \mathbb{R}^{N \times D}$$where $N$ is the number of tokens for the full input and $D$ is the representation dimension. Depending on the downstream task, features can be aggregated:
- Global average pooling: $\bar{s} = \frac{1}{N}\sum_{i=1}^{N} s_i \in \mathbb{R}^D$ — for classification
- CLS token: $s_{\text{CLS}} \in \mathbb{R}^D$ — if the encoder uses a CLS token
- Token-level features: $s \in \mathbb{R}^{N \times D}$ — for dense prediction tasks (segmentation, detection)
- Multi-layer features: Concatenation of intermediate layer outputs — for tasks requiring multi-scale features
Evaluation protocols:
- Linear probing: Train a single linear layer $W \in \mathbb{R}^{C \times D}$ on top of frozen encoder features. This evaluates the linear separability of the learned representations and is the standard benchmark for self-supervised methods.
- Attentive probing: A lightweight cross-attention module that attends to the full token sequence $s \in \mathbb{R}^{N \times D}$. This evaluates whether fine-grained spatial information is preserved in the representations.
- Fine-tuning: Unfreezing all encoder parameters and training end-to-end on the downstream task with a task-specific head. This evaluates the encoder's utility as an initialization.
- $k$-NN evaluation: Nearest-neighbor classification in the representation space without any training. This evaluates the raw geometric structure of the latent space.
9. Results & Benchmarks
The JEPA position paper (LeCun, 2022) is a theoretical contribution — it does not contain empirical benchmarks or experimental results. The paper's contribution is the architecture and the argument for why latent prediction should outperform pixel prediction and contrastive learning. Empirical validation came through subsequent instantiations.
However, to ground the JEPA framework in concrete numbers, we present results from I-JEPA (Assran et al., CVPR 2023), the first direct realization of the JEPA principles for images, which closely follows the framework described in the position paper:
9.1 ImageNet-1K Classification (I-JEPA Results)
| Method | Architecture | Pretraining Data | Linear Probe (Top-1 %) | Approach |
|---|---|---|---|---|
| I-JEPA | ViT-H/14 (632M) | IN-1K | 77.3 | JEPA (latent prediction) |
| MAE | ViT-H/14 (632M) | IN-1K | 76.0 | Generative (pixel prediction) |
| data2vec v2 | ViT-H/14 (632M) | IN-1K | 76.3 | Latent prediction + multi-mask |
| iBOT | ViT-L/16 (307M) | IN-1K | 75.4 | Contrastive + MIM |
| DINO | ViT-L/16 (307M) | IN-1K | 76.1 | Contrastive (self-distillation) |
| MoCo v3 | ViT-L/16 (307M) | IN-1K | 73.4 | Contrastive (momentum) |
9.2 Low-Shot and Transfer (I-JEPA Results)
| Method | 1% Labels (Top-1) | CIFAR-100 (Linear) | Places205 (Linear) |
|---|---|---|---|
| I-JEPA (ViT-H/14) | 70.5 | 84.5 | 59.2 |
| MAE (ViT-H/14) | 64.5 | 80.5 | 57.9 |
| DINO (ViT-L/16) | 69.8 | 83.1 | 58.6 |
9.3 Computational Efficiency (I-JEPA Results)
A critical practical advantage predicted by LeCun's position paper is that JEPA should be more computationally efficient than methods that require processing of all tokens. I-JEPA confirmed this:
| Method | Architecture | GPU Hours (16×A100) | IN-1K Linear (%) | Throughput Ratio |
|---|---|---|---|---|
| I-JEPA | ViT-H/14 | ~1150 | 77.3 | 1.0× (reference) |
| MAE | ViT-H/14 | ~1600 | 76.0 | 0.72× |
| iBOT | ViT-L/16 | ~3800 | 75.4 | 0.30× |
| DINO v2 | ViT-L/16 | ~4000+ | 76.1 | 0.29× |
The efficiency gain comes from two sources: (1) the context encoder only processes $\sim$15% of the tokens (the unmasked context), and (2) no data augmentation is required, eliminating the cost of multi-crop augmentation used by contrastive methods.
9.4 Key Ablations from I-JEPA
The following ablation studies from I-JEPA (Assran et al., 2023) validate specific architectural decisions from the JEPA framework:
| Ablation | Modification | IN-1K Linear (%) | Δ vs Default |
|---|---|---|---|
| Default I-JEPA | — | 73.4 (ViT-L) | — |
| Random masking | Random patches instead of blocks | 69.1 | −4.3 |
| Single target block | $K=1$ instead of $K=4$ | 71.8 | −1.6 |
| Pixel reconstruction loss | Predict pixels instead of latent | 68.2 | −5.2 |
| No predictor (direct match) | L2 between encoded context and target | collapse | — (total failure) |
| Wide predictor | Same dim as encoder | 71.0 | −2.4 |
| No EMA (both trained) | Both encoders receive gradients | collapse | — (total failure) |
These ablations validate every major architectural decision proposed in LeCun's position paper: block masking outperforms random masking, latent prediction outperforms pixel prediction, the predictor is necessary and should be narrow, and the EMA target encoder is essential for preventing collapse.
10. Connection to the JEPA Family
The JEPA position paper is the origin point of the entire JEPA family of architectures. It occupies a unique position: it is not an empirical methods paper with specific implementation details, but rather the theoretical and philosophical foundation from which all subsequent JEPA variants derive.
The JEPA Lineage
The following timeline traces how the ideas in the position paper were instantiated and extended:
What later variants borrowed from the position paper:
- The asymmetric encoder-predictor-target architecture: Every JEPA variant uses this fundamental three-component structure with EMA target encoder and stop-gradient.
- Latent space prediction: No JEPA variant predicts raw input data. The entire family operates in learned latent spaces, as prescribed by the position paper.
- Block masking over random masking: The position paper's argument against random masking influenced all subsequent designs. I-JEPA uses multi-block, V-JEPA uses spatiotemporal tubes, A-JEPA uses frequency-temporal blocks.
- Minimal data augmentation: The position paper argued that JEPA should not rely on hand-crafted augmentations. I-JEPA achieved strong results with no augmentation beyond basic resizing — a stark contrast to contrastive methods that require multi-crop, color jitter, and other augmentations.
What is genuinely novel in the position paper:
The position paper's contribution is not a single technical trick but a coherent architectural philosophy grounded in energy-based model theory and cognitive science intuitions. Specifically:
- The argument that prediction should happen in latent space, not input space. While prior work (BYOL, SimSiam) operated in latent space, they used global invariance objectives, not structured prediction. JEPA is the first framework to articulate why latent prediction is superior: it enables implicit information filtering, avoiding the waste of modeling capacity on unpredictable details.
- The predictor as a core architectural component for collapse prevention and structured reasoning. Prior self-distillation methods treated the projection head as an afterthought. JEPA positions the predictor as a first-class component that (a) captures the functional relationship between input regions, (b) provides an information bottleneck for collapse prevention, and (c) enables the model to learn about the structure of the world (spatial, temporal, causal relationships).
- The vision of JEPA as a world model for planning. The paper goes beyond representation learning to propose that JEPA-style predictive models, when conditioned on actions, can serve as world models for hierarchical planning — a vision that connects self-supervised learning to the broader goal of autonomous machine intelligence.
The world model vision: Perhaps the most ambitious aspect of the position paper is its proposal that JEPA is not merely a pretraining method but a world model architecture. LeCun envisions a hierarchical JEPA where:
- A low-level JEPA predicts short-term sensory representations from recent context.
- A mid-level JEPA predicts abstract state representations at longer time scales.
- A high-level JEPA predicts goal-relevant features over extended horizons.
- An action-conditioned variant predicts the consequences of actions, enabling planning: $\hat{s}_{t+1} = g_\phi(s_t, a_t)$ where $a_t$ is the action taken at time $t$.
This hierarchical world model would enable an agent to plan by simulating the consequences of action sequences in latent space — without ever predicting or rendering actual sensory observations. This vision has begun to be realized in works on action-conditioned video prediction and model-based reinforcement learning using JEPA-style architectures.
11. Comparison with Contrastive and Generative Approaches
The position paper devotes substantial discussion to contrasting JEPA with its two main alternatives: contrastive learning and generative (reconstructive) learning. Understanding these distinctions is essential for understanding JEPA's place in the landscape.
JEPA vs. Generative (MAE, diffusion models): Generative methods predict the raw input — pixels, waveforms, tokens. This is problematic for two reasons. First, raw input space contains massive amounts of unpredictable information (exact textures, noise, stochastic details). The model must spend capacity modeling this irrelevant information, because the loss penalizes any discrepancy. Second, when uncertainty is high (the model is unsure which of several plausible completions is correct), L2 loss in pixel space produces the average of all possibilities — a blurry, unrealistic result. JEPA avoids both problems: the encoder filters unpredictable information before prediction, and L2 loss in the filtered latent space corresponds to predicting the common semantic content shared by all plausible completions.
JEPA vs. Contrastive (SimCLR, MoCo, DINO): Contrastive methods learn an energy function that distinguishes compatible pairs from incompatible ones. They require either explicit negative samples (SimCLR, MoCo) or architectural tricks (BYOL, SimSiam) to prevent collapse. The key limitation is that contrastive methods learn an invariance — the embedding of one augmented view should match the embedding of another augmented view of the same input. This is a much simpler task than JEPA's prediction task: contrastive methods learn that "these two views are compatible" while JEPA learns "given this context, the missing content should have this representation." JEPA's prediction captures richer structural information about the relationships between parts of the input.
JEPA vs. BYOL/SimSiam (non-contrastive joint-embedding): BYOL and SimSiam also use asymmetric architectures without negatives, but they operate on global representations — the entire input is encoded into a single vector, and the objective is invariance between augmented views. JEPA operates on structured, position-dependent representations and performs spatially-structured prediction. This is a strictly harder task that yields representations with richer spatial and relational structure.
12. Summary
Main Contribution: A coherent theoretical framework — grounded in energy-based model theory — that unifies the encoder-predictor-target architecture, block masking, EMA target encoders, and predictor bottlenecks into a single, modality-agnostic blueprint for self-supervised learning. The framework's validity was subsequently demonstrated by I-JEPA, V-JEPA, A-JEPA, and numerous other instantiations that matched or exceeded the performance of both generative and contrastive methods across images, video, audio, and multimodal domains.
When to use JEPA vs. alternatives:
- Use JEPA when you want efficient pretraining without hand-crafted augmentations, when you need representations that capture semantic structure (not pixel-level detail), when working in domains where constructing good negatives is difficult, or when aiming to build world models that predict in latent space.
- Consider contrastive methods when you have well-understood augmentations for your domain and want the simplest possible training setup with minimal collapse risk.
- Consider generative methods when you need the model to also generate/reconstruct data (e.g., for image synthesis), or when you operate in a domain where all input details are relevant (e.g., lossless compression).
13. References
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Open Review preprint, Version 0.9.2. arXiv:2306.02572.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ECCV 2024. (V-JEPA)
- Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pinto, B. A., Zheng, Z., Azabou, M., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. (BYOL)
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022. (MAE)
- Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020. (SimCLR)
- He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020. (MoCo)
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021. (DINO)
- Chen, X. & He, K. (2021). Exploring Simple Siamese Representation Learning. CVPR 2021. (SimSiam)
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. (ViT)
- Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. ICML 2022.
- Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2022). iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR 2022.
- Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021.
- Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.