Joint-Embedding Predictive Architecture (JEPA)

1. Introduction

How do humans learn so efficiently about the world? An infant, with remarkably little explicit supervision, builds a rich internal model of physics, object permanence, and cause-and-effect — all within the first few months of life. Modern AI systems, by contrast, require billions of labeled examples, meticulously curated reward functions, or massive generative pretraining to achieve a fraction of this competence. The Joint-Embedding Predictive Architecture (JEPA), proposed by Yann LeCun in his 2022 position paper A Path Towards Autonomous Machine Intelligence, offers a foundational blueprint for closing this gap — not by predicting raw sensory inputs, but by predicting abstract representations of the world in a learned latent space.

The central problem JEPA addresses is deceptively simple: how should a machine learn the structure of the world from observation alone, without requiring reconstruction of every irrelevant detail? Generative models (autoencoders, masked autoencoders, diffusion models) attempt to predict input data in pixel space or token space. This forces the model to waste capacity on modeling every texture gradient, noise pattern, and stochastic detail — information that is often irrelevant for understanding. Contrastive methods (SimCLR, MoCo, CLIP) sidestep pixel prediction but introduce their own pathologies: they require careful negative sampling, are susceptible to dimensional collapse, and their energy landscape only distinguishes "compatible" from "incompatible" pairs without modeling the relationships between compatible inputs. JEPA proposes a third path: predict the representation of one part of the input from the representation of another part, entirely within a learned latent space.

As the foundational architecture of the JEPA family, this work does not derive from any predecessor — it is the predecessor. LeCun's paper establishes the theoretical framework, the architectural principles, and the philosophical motivation that would later give rise to I-JEPA (image), V-JEPA (video), A-JEPA (audio), MC-JEPA (multimodal), and numerous other instantiations. The key contribution is both an architecture and an argument: that non-generative, non-contrastive self-supervised learning in latent space is the most promising path toward machines that can learn world models, reason, and plan.

In this article, we first describe the core method and intuition behind JEPA, contrasting it with generative and contrastive paradigms. We then present the complete architecture with detailed diagrams, dissect every component (encoders, predictor, masking, loss), and formalize the training algorithm. We discuss implementation considerations, analyze the energy-based model perspective, examine how JEPA connects to the broader landscape of self-supervised learning, and conclude with the variant's legacy as the conceptual seed of an entire architectural family.

2. Method

The core idea of JEPA can be stated in one sentence: instead of predicting what you see, predict what it means. More precisely, given one portion of an input (the context), JEPA predicts the latent representation of another portion (the target) — never attempting to reconstruct raw pixels, audio waveforms, or text tokens.

Think of it this way... Imagine you see the left half of a photograph showing a sandy beach with waves approaching the shore. A generative model would try to paint every pixel of the right half — the exact color of each grain of sand, the precise shape of every wave crest, the specific pattern of foam. But you, as a human, don't predict these details. Instead, you predict abstractions: the beach probably continues, there's likely more ocean, the sky is probably visible. You predict in the space of concepts, not pixels. JEPA works the same way — it learns an encoder that maps inputs to abstract representations, then trains a predictor to forecast those representations for missing parts. The details it cannot predict (exact wave shapes, specific sand grains) are simply discarded by the encoder, because they carry no predictive value.

The method proceeds in three conceptual steps:

Encode the context. A portion of the input $x$ (selected by a masking strategy) is fed through a context encoder $f_\theta$, producing a latent representation $s_x = f_\theta(x_{\text{context}})$. This encoder is trainable via backpropagation.
Encode the target. The complementary portion of the input (the part that was masked from the context encoder) is fed through a target encoder $f_{\bar{\theta}}$, producing a target representation $s_y = f_{\bar{\theta}}(x_{\text{target}})$. Critically, this target encoder is not trained by gradient descent — its parameters $\bar{\theta}$ are an exponential moving average (EMA) of the context encoder's parameters $\theta$. Gradients do not flow through this path.
Predict the target representation. A predictor network $g_\phi$ takes the context representation $s_x$ and a specification of what to predict (e.g., which spatial positions were masked), and outputs a predicted representation $\hat{s}_y = g_\phi(s_x, z)$, where $z$ encodes information about the target's location or identity. The training loss is the distance between $\hat{s}_y$ and the (stop-gradient) target $s_y$.

What makes this fundamentally different from a masked autoencoder (MAE)? In an MAE, the prediction target is the raw input — pixels, patches, tokens. The decoder must reconstruct everything, including the irrelevant stochastic details that carry no semantic meaning. In JEPA, the prediction target is the output of a learned encoder that can (and will) discard unpredictable information. The encoder is incentivized to retain only the information that is predictable from context — the semantic, structural, causal content — while filtering out unpredictable noise. This is an implicit form of information filtering that emerges naturally from the architecture, without requiring any explicit information bottleneck.

Common misconception: JEPA is not simply "a contrastive method without negatives." While both contrastive learning and JEPA operate in latent space, contrastive methods learn an energy function that only needs to assign low energy to positive pairs and high energy to negative pairs. JEPA's predictor must capture the functional relationship between context and target representations — it must model how the target representation relates to the context, not merely whether they are related. This is a strictly harder task that yields richer representations.

From an energy-based model (EBM) perspective, JEPA defines an energy function $E(x, y) = \| s_y - g_\phi(s_x, z) \|^2$ that measures the compatibility between an input context $x$ and a target $y$. Low energy indicates that the target is a plausible completion of the context; high energy indicates implausibility. The critical design question for any EBM is: how do you prevent the model from collapsing to a trivial solution where all energies are low? JEPA addresses this through the asymmetric architecture (EMA target encoder, stop-gradient, predictor bottleneck) rather than through explicit contrastive negatives or reconstruction constraints.

3. Model Overview

JEPA is a modality-agnostic architectural template. The 2022 position paper describes the abstract architecture without committing to a specific input type, encoder backbone, or predictor design — these choices are deferred to concrete instantiations (I-JEPA for images, V-JEPA for video, etc.). Here we describe the architecture at its most general, noting where design decisions must be made for any specific realization.

Figure 1: JEPA general training architecture. The context encoder and predictor are trained via backpropagation (solid green arrows). The target encoder is updated via exponential moving average of the context encoder weights (dashed lines). No gradients flow through the target branch.

At-a-Glance

Property	Value
Input type	Generic (modality-agnostic)
Masking strategy	Abstract — domain-specific (spatial blocks for images, temporal segments for video, etc.)
Encoder architecture	Unspecified (any deep network; position paper suggests transformer-based)
Predictor type	Narrow network conditioned on mask/position info $z$
Loss function	$\mathcal{L} = \\| \hat{s}_y - \text{sg}(s_y) \\|^2$ (L2 in latent space)
Key result	Theoretical framework — no benchmark numbers (position paper)
Parameters	Architecture-dependent; paper provides framework, not specific model
Collapse prevention	Asymmetric architecture: EMA target encoder + predictor bottleneck
Supervision	Fully self-supervised (no labels, no negatives, no reconstruction)

4. Main Components of JEPA

4.1 Context Encoder $f_\theta$

The context encoder is the primary trainable feature extractor in JEPA. It takes as input the unmasked portion of the input (the context) and produces a set of latent representations. Formally:

$$s_x = f_\theta(x_{\text{context}}) \in \mathbb{R}^{N_c \times D}$$

where $N_c$ is the number of context tokens/patches/positions and $D$ is the representation dimensionality. The architecture of $f_\theta$ is not prescribed by the JEPA framework — it could be a Vision Transformer (ViT), a convolutional network, a graph neural network, or any architecture appropriate for the input modality. However, LeCun's paper strongly suggests transformer-based architectures due to their flexibility in handling variable-length sequences and their natural compatibility with masked prediction tasks.

Key design principles for the context encoder:

Receives only visible (unmasked) tokens. Unlike MAE-style approaches where mask tokens are also processed, JEPA's context encoder only processes the context region. This is computationally efficient (fewer tokens) and prevents information leakage.
Positional encoding is essential. Since the predictor must forecast representations at specific target locations, the context encoder must preserve spatial/temporal position information in its outputs. Learnable or sinusoidal positional embeddings are added to the input.
Trained end-to-end via backpropagation. Gradients from the prediction loss $\mathcal{L}$ flow through the predictor and back into the context encoder, updating $\theta$.

The context encoder has a dual objective that it learns implicitly: it must produce representations that are (1) informative enough for the predictor to forecast target representations, and (2) abstract enough that the prediction task is feasible — filtering out unpredictable noise and retaining predictable structure.

4.2 Target Encoder $f_{\bar{\theta}}$

The target encoder has the same architecture as the context encoder but serves a fundamentally different role: it produces the prediction targets. It processes the masked (target) region of the input:

$$s_y = f_{\bar{\theta}}(x_{\text{target}}) \in \mathbb{R}^{N_t \times D}$$

where $N_t$ is the number of target tokens and $\bar{\theta}$ denotes the target encoder parameters. The critical design decisions:

Exponential Moving Average (EMA) update: The target encoder parameters $\bar{\theta}$ are not trained by gradient descent. Instead, after each optimization step, they are updated as:

$$\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$$

where $\tau \in [0, 1)$ is the momentum coefficient (typically $\tau \geq 0.996$) and $\theta$ are the current context encoder parameters. This creates a slowly-evolving target that provides stable prediction targets while still tracking the improving encoder.

Stop-gradient: No gradients flow through the target encoder during backpropagation. The target representation $s_y$ is treated as a fixed target for the current optimization step. This is written as $\text{sg}(s_y)$ or $\overline{s_y}$ in the loss function. The combination of EMA and stop-gradient is essential: without them, the system would collapse to the trivial solution where both encoders output a constant, achieving zero prediction loss.

Momentum schedule: In practice (as demonstrated in later instantiations like I-JEPA), the momentum $\tau$ follows a cosine schedule from a starting value $\tau_0$ (e.g., 0.996) to 1.0 over the course of training. Early in training, faster updates ($\tau$ closer to 0.996) allow the target encoder to track the rapidly improving context encoder. Later, slower updates ($\tau$ closer to 1.0) provide more stable targets:

$$\tau_t = 1 - (1 - \tau_0) \cdot \left( \cos\left(\frac{\pi t}{T}\right) + 1 \right) / 2$$

where $t$ is the current step and $T$ is the total number of training steps.

Think of it this way... The target encoder is like a "teacher" that evolves slowly, providing the "student" (context encoder + predictor) with stable learning targets. If the teacher changed as fast as the student (i.e., $\tau = 0$), the targets would be a moving target too volatile to learn from — imagine studying for an exam where the answers change every time you look at them. The EMA ensures the teacher changes slowly enough to provide consistent targets, but fast enough to eventually reflect what the student has learned.

4.3 Predictor $g_\phi$

The predictor is the component that makes JEPA fundamentally different from both contrastive and generative methods. Given the context representation $s_x$ and information $z$ about which target positions to predict, the predictor outputs a prediction of the target representation:

$$\hat{s}_y = g_\phi(s_x, z) \in \mathbb{R}^{N_t \times D}$$

The predictor must solve a non-trivial mapping problem: from the context representation at positions $\{p_1, \ldots, p_{N_c}\}$, infer what the representation would be at the target positions $\{q_1, \ldots, q_{N_t}\}$. This requires the predictor to learn spatial/temporal/semantic relationships between different parts of the input.

Architecture: LeCun's paper does not prescribe a specific predictor architecture, but the design principle is clear — the predictor should be narrow (lower capacity than the encoders). A narrow predictor creates an information bottleneck that forces the encoders to produce maximally informative representations. If the predictor were as large as the encoder, it could memorize patterns without requiring good encoder representations. Subsequent implementations (I-JEPA, V-JEPA) use narrow transformers with fewer layers and smaller hidden dimensions than the encoder.

Conditioning on target specification $z$: The predictor must know what to predict. The variable $z$ encodes this information — typically as positional embeddings for the target locations. In a vision model, $z$ might be a set of learnable mask tokens at the positions of the masked patches. In a temporal model, $z$ might encode the future time step to predict. The predictor attends to both $s_x$ (the context) and $z$ (the target specification) to produce its output.

Why the predictor matters for collapse prevention: Without a predictor (i.e., if JEPA simply minimized $\|s_x - s_y\|^2$ between context and target representations), the system would be a variant of BYOL/SimSiam and would rely entirely on the EMA asymmetry to prevent collapse. The predictor provides an additional bottleneck: the context encoder is incentivized to produce representations that contain predictive information about the target, while the predictor's limited capacity prevents it from simply copying inputs. This creates a natural information filtering mechanism.

4.4 Masking Strategy

The masking strategy determines how the input is partitioned into context and target regions. This is the primary mechanism that defines the self-supervised prediction task and has an enormous impact on what representations are learned.

LeCun's position paper discusses masking at an abstract level, identifying key principles:

The target region should be informative. Masking should remove semantically meaningful content that the model must reason about to predict. Random pixel masking is less effective than masking coherent semantic regions.
The prediction task should be non-trivial. If the context contains almost all the information needed to trivially reconstruct the target, the model learns nothing useful. The masking ratio should be substantial.
Multi-block masking is preferred over single-block masking. Predicting multiple separate target blocks from a context forces the model to build a more holistic understanding than predicting a single contiguous region.

Figure 2: Masking strategies for JEPA. Random patch masking (left) can be solved with local interpolation. Block masking (center) forces the model to reason about larger semantic structures. Multi-block masking (right) requires holistic scene understanding.

The position paper argues strongly against random patch masking (as used in MAE) because adjacent patches provide sufficient local context to reconstruct individual missing patches via interpolation — the model never needs to learn high-level semantics. Block masking, where large contiguous regions are removed, forces the model to predict at a higher level of abstraction because local texture information is insufficient to solve the task.

4.5 Loss Function

The JEPA loss function measures the discrepancy between predicted and actual target representations in latent space. The basic form is:

$$\mathcal{L}(\theta, \phi) = \frac{1}{N_t} \sum_{i=1}^{N_t} \left\| \hat{s}_y^{(i)} - \text{sg}\left(s_y^{(i)}\right) \right\|_2^2$$

where:

$\hat{s}_y^{(i)} = g_\phi(s_x, z)_i$ is the predicted representation at target position $i$, produced by the predictor
$s_y^{(i)} = f_{\bar{\theta}}(x_{\text{target}})_i$ is the actual target representation at position $i$, produced by the target encoder
$\text{sg}(\cdot)$ denotes the stop-gradient operator — gradients do not flow through this term
$N_t$ is the number of target positions
$\theta$ are the context encoder parameters, $\phi$ are the predictor parameters
$\bar{\theta}$ are the target encoder parameters (EMA of $\theta$, not optimized by this loss)

The loss is minimized with respect to $\theta$ and $\phi$ jointly. Gradients flow through $\hat{s}_y$ back to both the predictor $g_\phi$ and the context encoder $f_\theta$. The target representations $s_y$ are treated as fixed targets for each step.

Why L2 loss works here (but wouldn't for pixel prediction): In pixel space, L2 loss produces blurry predictions because it averages over all possible reconstructions. In latent space, this problem is dramatically reduced: the encoder has already discarded the unpredictable, stochastic details. What remains in the latent space is precisely the deterministic, predictable content — and for this content, L2 is a reasonable distance measure. The encoder is implicitly learned to make L2 an appropriate loss by retaining only what is consistently predictable.

Why this loss prevents collapse (analysis): The most dangerous failure mode for JEPA is representational collapse — where the encoders map all inputs to the same constant vector, trivially achieving zero loss. JEPA prevents this through three complementary mechanisms:

EMA target encoder + stop-gradient: The target encoder evolves slowly and doesn't receive gradients. The context encoder cannot "negotiate" with the target encoder to jointly collapse — it must adapt to whatever the target encoder produces. Since the target encoder started from a random initialization and evolves slowly, its outputs remain diverse for long enough that the context encoder learns useful features.
Predictor bottleneck: The predictor is deliberately narrow. If the context encoder produced constant representations, the narrow predictor could not map them to the diverse target representations that the EMA-lagged target encoder still produces. The predictor's limited capacity creates a pressure for the context encoder to produce informative inputs.
Multi-target prediction: Predicting representations at multiple target positions simultaneously is harder to collapse than predicting a single target. The model must produce position-dependent predictions, which requires non-trivial encoding of the context.

Collapse is not fully eliminated by architecture alone. While the above mechanisms significantly reduce the risk of collapse, LeCun acknowledges in the position paper that careful hyperparameter tuning (especially the EMA momentum schedule and predictor capacity) is necessary. The later VICReg paper by Bardes et al. (2022) proposes explicit variance-invariance-covariance regularization as a complementary collapse-prevention mechanism. The relationship between architectural and regularization-based collapse prevention remains an active area of research.

4.6 Energy-Based Model Perspective

LeCun frames JEPA within the broader context of Energy-Based Models (EBMs). An EBM defines a scalar energy function $E(x, y)$ over input pairs, where low energy indicates compatibility and high energy indicates incompatibility. JEPA's energy function is:

$$E_w(x, y) = \left\| f_{\bar{\theta}}(y) - g_\phi\left(f_\theta(x), z\right) \right\|^2$$

where $w = (\theta, \phi, \bar{\theta})$ are all the model parameters. This energy is low when the predicted representation of $y$ from context $x$ matches the actual representation of $y$.

The fundamental challenge in EBM training is shaping the energy landscape so that compatible pairs have low energy while incompatible pairs have high energy. There are four principal strategies:

Contrastive methods: Explicitly push up the energy of negative (incompatible) pairs while pulling down the energy of positive pairs. Examples: SimCLR, MoCo, InfoNCE.
Regularization methods: Add regularization terms that prevent the energy surface from becoming flat (collapsed). Examples: VICReg, Barlow Twins.
Architectural methods: Design the architecture so that collapse is structurally discouraged. Examples: JEPA (asymmetric EMA + predictor bottleneck), BYOL.
Generative/reconstruction methods: Train the model to minimize energy only for observed data, relying on the reconstruction objective to implicitly raise energy for unobserved data. Examples: VAE, MAE.

JEPA relies primarily on the architectural strategy (option 3), optionally supplemented by regularization (option 2). LeCun argues in the paper that this is preferable to contrastive methods because (a) constructing good negative samples is difficult, especially in structured prediction tasks, and (b) contrastive methods have an implicit mode-covering behavior that can lead to overly broad, low-information representations.

4.7 Information Filtering and the Role of Latent Prediction

A distinctive and theoretically important property of JEPA is its implicit information filtering. This emerges naturally from the architecture without any explicit information bottleneck, dropout, or capacity constraint on the encoder (though the predictor bottleneck helps).

Consider what happens during training: the encoder $f_\theta$ learns to map inputs to a representation space where the prediction loss is minimized. Any information in the input that is not predictable from context — random texture variations, sensor noise, precise lighting conditions, exact pixel values — will not help reduce the prediction loss. Including such information in the representation would actually increase the prediction loss, because the predictor would be penalized for failing to predict these unpredictable details. Therefore, the encoder is implicitly trained to discard unpredictable information and retain only the predictable, semantic content.

$$I(s_y; x_{\text{target}}) \leq I(x_{\text{target}}; x_{\text{target}}) = H(x_{\text{target}})$$

where $I(\cdot;\cdot)$ denotes mutual information and $H(\cdot)$ denotes entropy. The encoder compresses the input, and the compression is guided by what is predictable from context. Formally, the encoder learns representations that maximize:

$$I(s_y; s_x) \quad \text{subject to} \quad s_y = f_{\bar\theta}(x_\text{target})$$

This is the information-theoretic dual of the observation that generative models waste capacity on unpredictable details: JEPA automatically learns to ignore them.

5. Implementation Details

The JEPA position paper by LeCun (2022) is a theoretical framework paper, not an empirical methods paper. It does not provide specific hyperparameters, benchmark numbers, or an implementation. No public repository exists. The specific instantiation of JEPA with concrete implementation details came with I-JEPA (Assran et al., 2023) for images and V-JEPA (Bardes et al., 2024) for video.

However, the position paper does establish architectural principles that constrain any valid implementation. We present these as a reference table, noting which values come from the paper's discussion versus later instantiations.

Hyperparameter	Value / Recommendation	Source & Notes
Context Encoder	Deep network (transformer recommended)	Position paper; specific architecture deferred to instantiations
Target Encoder	Same architecture as context encoder	Position paper; EMA-updated copy
Predictor	Narrow network (fewer layers, smaller hidden dim)	Position paper principle; I-JEPA uses 12-layer narrow transformer
Predictor capacity ratio	~1/4 to 1/2 of encoder capacity	Implicit from position paper discussion; I-JEPA uses 384-dim predictor vs 1024-dim encoder for ViT-L
EMA momentum $\tau$	$\geq 0.996$, schedule to $1.0$	Not specified in position paper; I-JEPA uses cosine schedule $0.996 \to 1.0$
Masking strategy	Block masking (multi-block preferred)	Position paper; I-JEPA uses 4 target blocks, 85% mask ratio
Loss function	L2 (MSE) in latent space	Position paper; later variants explore smooth-L1, cosine similarity
Optimizer	Not specified	I-JEPA uses AdamW with $\beta_1=0.9$, $\beta_2=0.95$
Learning rate	Not specified	I-JEPA uses $1.5 \times 10^{-4}$ with cosine decay
Batch size	Not specified	I-JEPA uses 2048
Warmup	Not specified	I-JEPA uses 15 epochs
Training epochs	Not specified	I-JEPA uses 600 epochs on ImageNet-1K
Data augmentation	Minimal to none (by design)	Position paper; JEPA should not require hand-crafted augmentations
Normalization	Layer normalization on targets	Common in implementations; prevents loss scale issues
Weight decay	Not specified	I-JEPA uses 0.05
Mixed precision	Not specified	I-JEPA uses bf16 for efficiency
GPU requirements	Not specified	I-JEPA: 16 A100 GPUs, ~72h for ViT-H/14 on IN-1K

No public implementation available. All implementation details in this table are either principles from the position paper or reference values from subsequent instantiations (I-JEPA, V-JEPA). The position paper itself provides the framework; concrete implementations are contributed by later works in the JEPA family.

6. Algorithm

Algorithm 1: JEPA Training — One Step

Input: Batch of inputs $\{x_1, \ldots, x_B\}$; context encoder $f_\theta$; target encoder $f_{\bar{\theta}}$; predictor $g_\phi$; EMA momentum $\tau$

Output: Updated parameters $\theta$, $\phi$, $\bar{\theta}$

Hyperparameters: Learning rate $\eta$; masking ratio $\rho$; number of target blocks $K$

1 for each $x$ in batch do

2 Sample $K$ target block regions $\mathcal{T}_1, \ldots, \mathcal{T}_K$ from $x$ (see Algorithm 2)

3 Define context region $\mathcal{C} = \text{complement}(\mathcal{T}_1 \cup \cdots \cup \mathcal{T}_K)$

4 Extract context tokens: $x_\mathcal{C} \leftarrow \text{extract}(x, \mathcal{C})$

5 Extract target tokens: $x_{\mathcal{T}_k} \leftarrow \text{extract}(x, \mathcal{T}_k)$ for $k = 1, \ldots, K$

6 Encode context: $s_x \leftarrow f_\theta(x_\mathcal{C})$ // $s_x \in \mathbb{R}^{N_c \times D}$

7 for each target block $k = 1, \ldots, K$ do

8 Encode target: $s_{y_k} \leftarrow \text{sg}\bigl(f_{\bar{\theta}}(x_{\mathcal{T}_k})\bigr)$ // stop-gradient; $s_{y_k} \in \mathbb{R}^{N_{t_k} \times D}$

9 Form positional specification $z_k$ for target block $k$

10 Predict target: $\hat{s}_{y_k} \leftarrow g_\phi(s_x, z_k)$ // $\hat{s}_{y_k} \in \mathbb{R}^{N_{t_k} \times D}$

11 end for

12 Compute loss: $\mathcal{L} \leftarrow \frac{1}{K} \sum_{k=1}^{K} \frac{1}{N_{t_k}} \sum_{i=1}^{N_{t_k}} \left\| \hat{s}_{y_k}^{(i)} - s_{y_k}^{(i)} \right\|_2^2$

13 end for

14 Average loss over batch: $\bar{\mathcal{L}} \leftarrow \frac{1}{B} \sum_{b=1}^{B} \mathcal{L}_b$

15 Update encoder: $\theta \leftarrow \theta - \eta \nabla_\theta \bar{\mathcal{L}}$

16 Update predictor: $\phi \leftarrow \phi - \eta \nabla_\phi \bar{\mathcal{L}}$

17 EMA update: $\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$

Algorithm 2: Block Masking Strategy

Input: Input spatial dimensions $(H, W)$ (or temporal $T$ for sequences); target block count $K$; target scale range $(s_{\min}, s_{\max})$; target aspect ratio range $(r_{\min}, r_{\max})$

Output: $K$ target block specifications $\{\mathcal{T}_1, \ldots, \mathcal{T}_K\}$; context region $\mathcal{C}$

1 Initialize $\mathcal{M} \leftarrow \emptyset$ // set of all masked positions

2 for $k = 1, \ldots, K$ do

3 Sample scale: $s_k \sim \text{Uniform}(s_{\min}, s_{\max})$

4 Sample aspect ratio: $r_k \sim \text{Uniform}(r_{\min}, r_{\max})$

5 Compute block height: $h_k \leftarrow \lfloor \sqrt{s_k \cdot H \cdot W \cdot r_k} \rfloor$

6 Compute block width: $w_k \leftarrow \lfloor \sqrt{s_k \cdot H \cdot W / r_k} \rfloor$

7 Sample top-left position: $(y_k, x_k) \sim \text{Uniform}\bigl([0, H - h_k] \times [0, W - w_k]\bigr)$

8 Define $\mathcal{T}_k \leftarrow \{(i, j) : y_k \leq i < y_k + h_k, \; x_k \leq j < x_k + w_k\}$

9 $\mathcal{M} \leftarrow \mathcal{M} \cup \mathcal{T}_k$

10 end for

11 Define context: $\mathcal{C} \leftarrow \{(i, j) : 0 \leq i < H, 0 \leq j < W\} \setminus \mathcal{M}$

12 return $\{\mathcal{T}_1, \ldots, \mathcal{T}_K\}$, $\mathcal{C}$

Algorithm 3: JEPA — Collapse Prevention via Asymmetric Architecture

Mechanisms:

1 Stop-gradient on target branch:

2 $\nabla_{\bar{\theta}} \mathcal{L} = 0$ // Target encoder receives no gradient signal

3 Effect: context encoder cannot coordinate with target encoder to find trivial solution

4 EMA updates provide slowly-evolving targets:

5 $\bar{\theta}_t = \tau \bar{\theta}_{t-1} + (1 - \tau) \theta_t$ where $\tau \in [0.996, 1.0]$

6 Effect: targets are diverse (reflect past encoder states) but consistent

7 Narrow predictor creates information bottleneck:

8 $\text{dim}(g_\phi) \ll \text{dim}(f_\theta)$ // Predictor has lower capacity than encoder

9 Effect: encoder must produce informative representations, not rely on predictor capacity

10 (Optional) Variance regularization:

11 $\mathcal{L}_\text{var} = \frac{1}{D} \sum_{d=1}^{D} \max\left(0, \gamma - \sqrt{\text{Var}(s^{(d)}) + \epsilon}\right)$

12 Effect: explicitly prevents dimensional collapse by ensuring representation variance

7. Training

A single JEPA training iteration proceeds through the following steps:

Sample a batch of $B$ input samples $\{x_1, \ldots, x_B\}$ from the dataset. No labels are used.
Generate masks. For each sample, use the block masking strategy (Algorithm 2) to define $K$ target regions $\{\mathcal{T}_1, \ldots, \mathcal{T}_K\}$ and the complementary context region $\mathcal{C}$. The masks are sampled stochastically — different samples in the batch have different masks, and different training steps use different masks for the same sample.
Tokenize/patchify the input. The raw input $x$ is converted to a sequence of tokens (e.g., for images, non-overlapping patches are flattened and linearly projected; for audio, spectral frames are embedded). Positional embeddings are added.
Separate context and target tokens. Using the mask, partition the tokens into context tokens (at positions in $\mathcal{C}$) and target tokens (at positions in $\mathcal{T}_1 \cup \cdots \cup \mathcal{T}_K$).
Context encoding. Feed the context tokens through the context encoder $f_\theta$ to obtain context representations $s_x \in \mathbb{R}^{N_c \times D}$. This is the primary trainable path.
Target encoding. Feed the target tokens through the target encoder $f_{\bar{\theta}}$ to obtain target representations $s_y \in \mathbb{R}^{N_t \times D}$. Apply stop-gradient: no gradient is computed for this path.
Prediction. The predictor $g_\phi$ receives the context representations $s_x$ and positional specifications $z_k$ for each target block, and produces predicted target representations $\hat{s}_{y_k}$ for each block.
Loss computation. Compute $\mathcal{L} = \frac{1}{K}\sum_k \frac{1}{N_{t_k}} \|\hat{s}_{y_k} - \text{sg}(s_{y_k})\|^2$.
Backpropagation. Compute gradients $\nabla_\theta \mathcal{L}$ and $\nabla_\phi \mathcal{L}$ and update the context encoder and predictor parameters.
EMA update. Update the target encoder: $\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau)\theta$.

Figure 3: Detailed JEPA training pipeline showing data flow from raw input through patchification, masking, dual-encoder processing, prediction, and loss computation. Dimension annotations are shown at each stage. Green solid lines indicate trainable paths; dashed lines indicate EMA/stop-gradient paths; red dashed lines indicate gradient backpropagation.

Mathematical formulation of the training objective. The full JEPA training objective, aggregated over a dataset $\mathcal{D}$ and stochastic masks, is:

$$\min_{\theta, \phi} \; \mathbb{E}_{x \sim \mathcal{D}} \; \mathbb{E}_{\mathcal{M} \sim p(\mathcal{M})} \left[ \frac{1}{K} \sum_{k=1}^{K} \frac{1}{N_{t_k}} \sum_{i=1}^{N_{t_k}} \left\| g_\phi\bigl(f_\theta(x_\mathcal{C}), z_k\bigr)_i - \text{sg}\bigl(f_{\bar{\theta}}(x_{\mathcal{T}_k})\bigr)_i \right\|_2^2 \right]$$

where $\mathcal{M}$ denotes the stochastic mask (which determines $\mathcal{C}$ and $\{\mathcal{T}_k\}$), $p(\mathcal{M})$ is the masking distribution (Algorithm 2), and $\bar{\theta}$ is updated via EMA after each optimization step (not optimized by this loss).

8. Inference

After pretraining, JEPA produces a feature extractor that can be deployed for downstream tasks. The inference procedure is substantially simpler than training.

Which encoder is kept? The target encoder $f_{\bar{\theta}}$ is typically used for downstream tasks, as it represents a smoothed version of the context encoder that has been shown empirically to produce slightly better features (analogous to BYOL/MoCo). Some implementations evaluate both and select based on downstream performance, but the target encoder is the default choice.

The predictor is discarded. The predictor $g_\phi$ served its purpose during training — it created the learning signal that drove the encoder to produce good representations. At inference time, only the encoder is needed.

Feature extraction: Given a new input $x$, the full (unmasked) input is fed through the kept encoder to produce a representation:

$$s = f_{\bar{\theta}}(x) \in \mathbb{R}^{N \times D}$$

where $N$ is the number of tokens for the full input and $D$ is the representation dimension. Depending on the downstream task, features can be aggregated:

Global average pooling: $\bar{s} = \frac{1}{N}\sum_{i=1}^{N} s_i \in \mathbb{R}^D$ — for classification
CLS token: $s_{\text{CLS}} \in \mathbb{R}^D$ — if the encoder uses a CLS token
Token-level features: $s \in \mathbb{R}^{N \times D}$ — for dense prediction tasks (segmentation, detection)
Multi-layer features: Concatenation of intermediate layer outputs — for tasks requiring multi-scale features

Figure 4: JEPA inference pipeline. The pretrained target encoder processes the full (unmasked) input. The predictor is discarded. Features are passed to a downstream head — either a linear probe (for classification) or an attentive probe (for dense prediction).

Evaluation protocols:

Linear probing: Train a single linear layer $W \in \mathbb{R}^{C \times D}$ on top of frozen encoder features. This evaluates the linear separability of the learned representations and is the standard benchmark for self-supervised methods.
Attentive probing: A lightweight cross-attention module that attends to the full token sequence $s \in \mathbb{R}^{N \times D}$. This evaluates whether fine-grained spatial information is preserved in the representations.
Fine-tuning: Unfreezing all encoder parameters and training end-to-end on the downstream task with a task-specific head. This evaluates the encoder's utility as an initialization.
$k$-NN evaluation: Nearest-neighbor classification in the representation space without any training. This evaluates the raw geometric structure of the latent space.

9. Results & Benchmarks

The JEPA position paper (LeCun, 2022) is a theoretical contribution — it does not contain empirical benchmarks or experimental results. The paper's contribution is the architecture and the argument for why latent prediction should outperform pixel prediction and contrastive learning. Empirical validation came through subsequent instantiations.

However, to ground the JEPA framework in concrete numbers, we present results from I-JEPA (Assran et al., CVPR 2023), the first direct realization of the JEPA principles for images, which closely follows the framework described in the position paper:

9.1 ImageNet-1K Classification (I-JEPA Results)

Method	Architecture	Pretraining Data	Linear Probe (Top-1 %)	Approach
I-JEPA	ViT-H/14 (632M)	IN-1K	77.3	JEPA (latent prediction)
MAE	ViT-H/14 (632M)	IN-1K	76.0	Generative (pixel prediction)
data2vec v2	ViT-H/14 (632M)	IN-1K	76.3	Latent prediction + multi-mask
iBOT	ViT-L/16 (307M)	IN-1K	75.4	Contrastive + MIM
DINO	ViT-L/16 (307M)	IN-1K	76.1	Contrastive (self-distillation)
MoCo v3	ViT-L/16 (307M)	IN-1K	73.4	Contrastive (momentum)

9.2 Low-Shot and Transfer (I-JEPA Results)

Method	1% Labels (Top-1)	CIFAR-100 (Linear)	Places205 (Linear)
I-JEPA (ViT-H/14)	70.5	84.5	59.2
MAE (ViT-H/14)	64.5	80.5	57.9
DINO (ViT-L/16)	69.8	83.1	58.6

9.3 Computational Efficiency (I-JEPA Results)

A critical practical advantage predicted by LeCun's position paper is that JEPA should be more computationally efficient than methods that require processing of all tokens. I-JEPA confirmed this:

Method	Architecture	GPU Hours (16×A100)	IN-1K Linear (%)	Throughput Ratio
I-JEPA	ViT-H/14	~1150	77.3	1.0× (reference)
MAE	ViT-H/14	~1600	76.0	0.72×
iBOT	ViT-L/16	~3800	75.4	0.30×
DINO v2	ViT-L/16	~4000+	76.1	0.29×

The efficiency gain comes from two sources: (1) the context encoder only processes $\sim$15% of the tokens (the unmasked context), and (2) no data augmentation is required, eliminating the cost of multi-crop augmentation used by contrastive methods.

9.4 Key Ablations from I-JEPA

The following ablation studies from I-JEPA (Assran et al., 2023) validate specific architectural decisions from the JEPA framework:

Ablation	Modification	IN-1K Linear (%)	Δ vs Default
Default I-JEPA	—	73.4 (ViT-L)	—
Random masking	Random patches instead of blocks	69.1	−4.3
Single target block	$K=1$ instead of $K=4$	71.8	−1.6
Pixel reconstruction loss	Predict pixels instead of latent	68.2	−5.2
No predictor (direct match)	L2 between encoded context and target	collapse	— (total failure)
Wide predictor	Same dim as encoder	71.0	−2.4
No EMA (both trained)	Both encoders receive gradients	collapse	— (total failure)

These ablations validate every major architectural decision proposed in LeCun's position paper: block masking outperforms random masking, latent prediction outperforms pixel prediction, the predictor is necessary and should be narrow, and the EMA target encoder is essential for preventing collapse.

10. Connection to the JEPA Family

The JEPA position paper is the origin point of the entire JEPA family of architectures. It occupies a unique position: it is not an empirical methods paper with specific implementation details, but rather the theoretical and philosophical foundation from which all subsequent JEPA variants derive.

The JEPA Lineage

The following timeline traces how the ideas in the position paper were instantiated and extended:

Figure 5: The JEPA family tree. The foundational JEPA framework (top) gives rise to modality-specific instantiations, which in turn spawn specialized variants.

What later variants borrowed from the position paper:

The asymmetric encoder-predictor-target architecture: Every JEPA variant uses this fundamental three-component structure with EMA target encoder and stop-gradient.
Latent space prediction: No JEPA variant predicts raw input data. The entire family operates in learned latent spaces, as prescribed by the position paper.
Block masking over random masking: The position paper's argument against random masking influenced all subsequent designs. I-JEPA uses multi-block, V-JEPA uses spatiotemporal tubes, A-JEPA uses frequency-temporal blocks.
Minimal data augmentation: The position paper argued that JEPA should not rely on hand-crafted augmentations. I-JEPA achieved strong results with no augmentation beyond basic resizing — a stark contrast to contrastive methods that require multi-crop, color jitter, and other augmentations.

What is genuinely novel in the position paper:

Novel Contribution: The JEPA Framework and its Theoretical Motivation

The position paper's contribution is not a single technical trick but a coherent architectural philosophy grounded in energy-based model theory and cognitive science intuitions. Specifically:

The argument that prediction should happen in latent space, not input space. While prior work (BYOL, SimSiam) operated in latent space, they used global invariance objectives, not structured prediction. JEPA is the first framework to articulate why latent prediction is superior: it enables implicit information filtering, avoiding the waste of modeling capacity on unpredictable details.
The predictor as a core architectural component for collapse prevention and structured reasoning. Prior self-distillation methods treated the projection head as an afterthought. JEPA positions the predictor as a first-class component that (a) captures the functional relationship between input regions, (b) provides an information bottleneck for collapse prevention, and (c) enables the model to learn about the structure of the world (spatial, temporal, causal relationships).
The vision of JEPA as a world model for planning. The paper goes beyond representation learning to propose that JEPA-style predictive models, when conditioned on actions, can serve as world models for hierarchical planning — a vision that connects self-supervised learning to the broader goal of autonomous machine intelligence.

The world model vision: Perhaps the most ambitious aspect of the position paper is its proposal that JEPA is not merely a pretraining method but a world model architecture. LeCun envisions a hierarchical JEPA where:

A low-level JEPA predicts short-term sensory representations from recent context.
A mid-level JEPA predicts abstract state representations at longer time scales.
A high-level JEPA predicts goal-relevant features over extended horizons.
An action-conditioned variant predicts the consequences of actions, enabling planning: $\hat{s}_{t+1} = g_\phi(s_t, a_t)$ where $a_t$ is the action taken at time $t$.

This hierarchical world model would enable an agent to plan by simulating the consequences of action sequences in latent space — without ever predicting or rendering actual sensory observations. This vision has begun to be realized in works on action-conditioned video prediction and model-based reinforcement learning using JEPA-style architectures.

11. Comparison with Contrastive and Generative Approaches

The position paper devotes substantial discussion to contrasting JEPA with its two main alternatives: contrastive learning and generative (reconstructive) learning. Understanding these distinctions is essential for understanding JEPA's place in the landscape.

Figure 6: Comparison of three self-supervised learning paradigms. Generative methods predict in input space (wasteful). Contrastive methods compare embeddings (need negatives). JEPA predicts in latent space (efficient, structured, no negatives).

JEPA vs. Generative (MAE, diffusion models): Generative methods predict the raw input — pixels, waveforms, tokens. This is problematic for two reasons. First, raw input space contains massive amounts of unpredictable information (exact textures, noise, stochastic details). The model must spend capacity modeling this irrelevant information, because the loss penalizes any discrepancy. Second, when uncertainty is high (the model is unsure which of several plausible completions is correct), L2 loss in pixel space produces the average of all possibilities — a blurry, unrealistic result. JEPA avoids both problems: the encoder filters unpredictable information before prediction, and L2 loss in the filtered latent space corresponds to predicting the common semantic content shared by all plausible completions.

JEPA vs. Contrastive (SimCLR, MoCo, DINO): Contrastive methods learn an energy function that distinguishes compatible pairs from incompatible ones. They require either explicit negative samples (SimCLR, MoCo) or architectural tricks (BYOL, SimSiam) to prevent collapse. The key limitation is that contrastive methods learn an invariance — the embedding of one augmented view should match the embedding of another augmented view of the same input. This is a much simpler task than JEPA's prediction task: contrastive methods learn that "these two views are compatible" while JEPA learns "given this context, the missing content should have this representation." JEPA's prediction captures richer structural information about the relationships between parts of the input.

JEPA vs. BYOL/SimSiam (non-contrastive joint-embedding): BYOL and SimSiam also use asymmetric architectures without negatives, but they operate on global representations — the entire input is encoded into a single vector, and the objective is invariance between augmented views. JEPA operates on structured, position-dependent representations and performs spatially-structured prediction. This is a strictly harder task that yields representations with richer spatial and relational structure.

12. Summary

Key Takeaway: JEPA (Joint-Embedding Predictive Architecture) establishes the foundational principle that self-supervised models should predict abstract representations in a learned latent space, not raw input data. This simple but profound shift enables implicit information filtering (the encoder discards unpredictable noise), eliminates the need for hand-crafted data augmentations and negative sampling, and opens a path toward world models that can plan by simulating the consequences of actions in latent space.

Main Contribution: A coherent theoretical framework — grounded in energy-based model theory — that unifies the encoder-predictor-target architecture, block masking, EMA target encoders, and predictor bottlenecks into a single, modality-agnostic blueprint for self-supervised learning. The framework's validity was subsequently demonstrated by I-JEPA, V-JEPA, A-JEPA, and numerous other instantiations that matched or exceeded the performance of both generative and contrastive methods across images, video, audio, and multimodal domains.

When to use JEPA vs. alternatives:

Use JEPA when you want efficient pretraining without hand-crafted augmentations, when you need representations that capture semantic structure (not pixel-level detail), when working in domains where constructing good negatives is difficult, or when aiming to build world models that predict in latent space.
Consider contrastive methods when you have well-understood augmentations for your domain and want the simplest possible training setup with minimal collapse risk.
Consider generative methods when you need the model to also generate/reconstruct data (e.g., for image synthesis), or when you operate in a domain where all input details are relevant (e.g., lossless compression).

13. References

LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Open Review preprint, Version 0.9.2. arXiv:2306.02572.
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ECCV 2024. (V-JEPA)
Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pinto, B. A., Zheng, Z., Azabou, M., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. (BYOL)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022. (MAE)
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020. (SimCLR)
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020. (MoCo)
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021. (DINO)
Chen, X. & He, K. (2021). Exploring Simple Siamese Representation Learning. CVPR 2021. (SimSiam)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. (ViT)
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. ICML 2022.
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2022). iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR 2022.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.

@misc{kinas2026jepa,
  author = {Kinas, Remek},
  title  = {JEPA Survey},
  year   = {2026},
  url    = {https://jepa.si5.pl}
}