AuthorsMur-Labadia, Muckley, Bar, Assran, Sinha, Rabbat, LeCun, Ballas, Bardes
Date2026-03
CategoryPhysics / World Models
Derives fromV-JEPA 2

1. Introduction

Self-supervised learning from video has advanced rapidly through the Joint-Embedding Predictive Architecture (JEPA) family. V-JEPA (Bardes et al., 2024) demonstrated that predicting masked spatiotemporal regions in latent space—rather than reconstructing raw pixels—yields powerful video representations for action recognition and temporal reasoning. V-JEPA 2 (Bardes et al., 2025) scaled this paradigm to combined image-and-video pretraining with larger Vision Transformer backbones, achieving strong performance on global classification benchmarks. However, a persistent limitation remained: while V-JEPA 2 excels at holistic scene understanding (e.g., classifying an action or identifying a scene category), its learned representations lack the spatial grounding necessary for dense prediction tasks such as object detection, instance segmentation, and semantic segmentation. The features, aggregated through global pooling or attentive probes, lose the fine-grained per-patch spatial information that dense tasks demand.

V-JEPA 2.1 (Mur-Labadia, Muckley, Bar, Assran, Sinha, Rabbat, LeCun, Ballas, & Bardes, 2026) directly addresses this gap. The core insight is that the standard JEPA prediction loss—which evaluates reconstruction quality at the level of masked block regions pooled or averaged over target tokens—does not explicitly incentivize the encoder to maintain spatially precise representations at each individual patch token. V-JEPA 2.1 introduces a dense predictive loss that operates at the per-token level: the predictor must reconstruct every individual target token with high fidelity, not merely match an aggregate representation of the masked region. This seemingly simple modification has profound consequences for the spatial quality of learned features.

The key contributions of V-JEPA 2.1, as described in the paper, are:

  • Dense prediction loss design. A token-level prediction objective that complements the standard JEPA block-level loss, forcing the encoder to retain patch-level spatial information throughout the network.
  • Combined image and video training with dense objectives. The dense loss is applied uniformly across both image and video inputs during pretraining, enabling spatial grounding to transfer from the high-resolution spatial detail in images to the spatiotemporal structure of video.
  • Spatial grounding from dense features. The resulting encoder produces features that are directly useful for object detection (e.g., with ViTDet-style heads) and segmentation (e.g., with linear or lightweight decoders), without requiring task-specific architectural modifications.
  • Pixel-level understanding from JEPA. V-JEPA 2.1 demonstrates that the JEPA framework—which by design never reconstructs pixels—can nonetheless produce representations competitive with pixel-reconstruction methods (e.g., MAE, VideoMAE) on dense prediction benchmarks, while retaining the superior semantic quality of latent prediction for classification tasks.

Compared to V-JEPA 2, the architectural changes are minimal: the same encoder, target encoder, and predictor backbone are reused. The critical difference lies in the loss computation. V-JEPA 2 evaluates prediction quality over masked blocks, which may allow spatial information to "wash out" across tokens within a block. V-JEPA 2.1 evaluates prediction quality per token, creating a direct gradient signal that rewards spatially precise representations at every position in the feature map. The result is a model that simultaneously achieves strong performance on global tasks (action recognition, image classification) and dense tasks (detection, segmentation)—a combination that prior JEPA variants could not achieve without sacrificing one axis of performance.

2. Method

Intuitive Overview

Consider a jigsaw puzzle analogy. In V-JEPA 2, you are given a partially completed puzzle and asked: "Does this missing region show a dog or a cat?" You can answer correctly by recognizing coarse patterns—fur color, ear shape, background context. But you do not need to know the exact placement of each individual piece within the region. In V-JEPA 2.1, the question changes: "For each missing piece, tell me exactly what appears on it—the precise texture, edge, and content." Now you must internalize fine-grained spatial details, not just the gist.

Key Intuition: V-JEPA 2.1 replaces the "describe the missing region" task with a "describe each missing piece individually" task. By demanding per-token precision, the encoder is forced to encode spatially grounded features at every patch position, not just globally informative features that lose spatial specificity.

The method retains the core JEPA structure: an online encoder processes visible (unmasked) tokens from an image or video, a predictor takes these visible representations and positional information about the masked locations to predict representations of the masked tokens, and a target encoder (updated via exponential moving average of the online encoder) provides the prediction targets. The loss measures the discrepancy between predictions and targets.

The fundamental methodological change in V-JEPA 2.1 is where and how this discrepancy is measured:

  • V-JEPA 2 (block-level loss): The predictor outputs representations for the masked region. The loss is computed after some form of aggregation (e.g., averaging tokens within a target block), or the per-token losses within a block are weighted in a way that emphasizes block-level coherence. The gradient signal per individual token is diluted.
  • V-JEPA 2.1 (dense token-level loss): The predictor outputs a representation for each individual masked token. The loss is computed independently at every masked token position. Each token receives a direct, unattenuated gradient signal demanding that its predicted representation match the target encoder's representation at that exact spatial (and temporal) location.
Why does this matter? In a standard ViT, each patch token carries information about a specific spatial location (e.g., a 16×16 pixel region). If the loss only evaluates predictions at the block level, the encoder can learn features where individual tokens drift toward encoding global context rather than local content—this is fine for classification but harmful for detection and segmentation. The dense loss anchors each token to its spatial position.

A second important aspect of V-JEPA 2.1 is the joint image-video training with the dense objective. Images provide high spatial resolution and diversity of objects and scenes. Videos provide temporal structure and motion. By applying the dense prediction loss to both modalities, the model learns spatially grounded features that generalize across static and dynamic visual content. The image training particularly benefits dense tasks because images often contain more diverse object configurations per sample than individual video frames.

The method does not introduce new masking strategies beyond those used in V-JEPA 2. The masking operates in the spatiotemporal token space: for video, blocks of tokens spanning spatial and temporal dimensions are masked; for images, spatial blocks are masked. The distinction is purely in how the prediction quality is evaluated—not in what is masked or how masking is applied.

3. Model Overview

At-a-Glance

Property V-JEPA 2.1
InputImages (224×224 or 448×448) and video clips (e.g., 16 frames at 224×224)
MaskingMulti-block spatiotemporal masking (same strategy as V-JEPA 2); ~75–90% mask ratio
EncoderVision Transformer (ViT-H/16 or ViT-G/14); processes only visible (unmasked) tokens
Target EncoderSame architecture as encoder; weights updated via EMA (no gradient)
PredictorNarrow Transformer (e.g., 12 layers, reduced embedding dim); takes visible tokens + mask token placeholders → predicts each masked token
LossDense token-level smooth-$\ell_1$ or $\ell_2$ loss between predicted and target representations at every masked position
Key ResultCompetitive with supervised and pixel-reconstruction methods on COCO detection/segmentation while retaining strong Kinetics-400/600 action recognition accuracy
Parameters~630M (ViT-H encoder) or ~1.1B+ (ViT-G encoder); predictor adds ~50–100M

Training Architecture Diagram

V-JEPA 2.1 — Training Architecture Image / Video B×C×T×H×W Patchify + Embed B×N×D Multi-Block Mask Split B×N_v×D visible full tokens B×N×D Online Encoder f_θ (trainable) Target Encoder f_ξ (EMA, frozen) EMA: ξ ← τξ + (1−τ)θ B×N_v×D B×N×D Predictor g_ϕ (trainable) visible reps + mask pos → pred B×N_m×D B×N_m×D Select masked token targets Dense Token-Level Loss ℓ(ŝ_i, s_i) per masked token i ∇θ, ∇ϕ no gradient (sg)
Figure 1. V-JEPA 2.1 training architecture. The online encoder (trainable) processes only visible tokens. The predictor (trainable) takes visible representations and mask-position embeddings to predict each individual masked token. The target encoder (EMA, no gradient) processes the full token set and provides per-token targets. The dense loss is computed independently at every masked token position, creating direct spatial gradient signals. Solid-bordered boxes are trainable; dashed borders indicate frozen/EMA components.

4. Main Components of V-JEPA 2.1

4.1 Encoder ($f_\theta$)

WHAT: The encoder is a standard Vision Transformer (ViT) that processes only the visible (unmasked) patch tokens from the input image or video. It maps visible tokens to a sequence of $D$-dimensional representations that capture both local content and global context.

HOW: The encoder follows the ViT architecture (Dosovitskiy et al., 2021). For V-JEPA 2.1, the authors employ ViT-H/16 (hidden dimension $D = 1280$, 32 heads, 32 layers, patch size 16×16) and ViT-G/14 ($D = 1408$, 16 heads, 40 layers, patch size 14×14) configurations, consistent with the V-JEPA 2 backbone. For video input with $T$ frames, spatiotemporal patch embedding produces $N = (T/t_p) \times (H/p) \times (W/p)$ tokens, where $t_p$ is the temporal patch size (typically 2) and $p$ is the spatial patch size. For images, $N = (H/p) \times (W/p)$. After masking removes approximately 75–90% of tokens, the encoder processes only $N_v = N - N_m$ visible tokens, where $N_m$ is the number of masked tokens.

Positional embeddings (sinusoidal or learned, factored into spatial and temporal components for video) are added to patch embeddings before encoder processing. The encoder does not receive any information about the masked positions—it sees only the visible tokens and their positional embeddings.

WHY: Processing only visible tokens is computationally efficient (following MAE-style sparse encoding) and forces the encoder to build representations purely from available context. The choice of ViT-H and ViT-G follows V-JEPA 2; the architectural contribution of V-JEPA 2.1 is not in the encoder design but in how its representations are trained through the dense loss. The authors report that the same encoder architecture, when trained with the dense loss versus the block-level loss, produces representations with substantially different spatial quality—confirming that the loss design, not the architecture, is the key variable.

4.2 Target Encoder ($f_\xi$)

WHAT: The target encoder is an identical copy of the online encoder whose parameters $\xi$ are updated via exponential moving average (EMA) of the online encoder parameters $\theta$. It processes the full set of tokens (both visible and masked positions) from the input to produce target representations. Crucially, no gradients flow through the target encoder.

HOW: At each training step, after the online encoder parameters $\theta$ are updated by the optimizer, the target encoder parameters are updated as:

$$\xi \leftarrow \tau \xi + (1 - \tau) \theta$$

where $\tau \in [0, 1)$ is the EMA momentum coefficient. Following prior JEPA work, $\tau$ is typically scheduled from a lower value (e.g., $\tau_0 = 0.996$) to a value close to 1 (e.g., $\tau_1 = 1.0$) over the course of training using a cosine schedule:

$$\tau_t = \tau_1 - (\tau_1 - \tau_0) \cdot \left(\cos\left(\frac{\pi t}{T_{\max}}\right) + 1\right) / 2$$

The target encoder receives the full token set $\{x_1, x_2, \ldots, x_N\}$ including tokens at masked positions. Its output provides token-level target representations $\{s_1, s_2, \ldots, s_N\}$, from which only the targets at masked positions $\{s_i\}_{i \in \mathcal{M}}$ are selected for the loss computation. The targets are typically layer-normalized before computing the loss.

WHY: The EMA target encoder serves two functions. First, it provides a slowly-evolving, stable set of targets that prevents the training from collapsing to trivial solutions (all representations becoming identical). Second, by processing all tokens—including those the online encoder never sees—it provides targets that reflect the full visual context, giving the predictor a meaningful reconstruction objective. In V-JEPA 2.1, the target encoder's role becomes even more critical: because the dense loss evaluates every individual masked token, the quality and stability of per-token targets directly determines training dynamics. The cosine EMA schedule (starting with faster target updates, transitioning to near-frozen targets) balances early-stage learning speed with late-stage target stability.

4.3 Predictor ($g_\phi$)

WHAT: The predictor is a lightweight Transformer that takes the online encoder's visible-token representations and positional embeddings for the masked positions, and outputs a predicted representation for each individual masked token.

HOW: The predictor architecture is a standard Transformer with reduced capacity relative to the encoder. It typically uses 12 layers, a hidden dimension $D_p$ smaller than the encoder dimension $D$ (e.g., $D_p = 384$ for ViT-H, forming an information bottleneck), and the same number or fewer attention heads. The predictor input is constructed as follows:

  1. The visible-token representations $\{h_j\}_{j \in \mathcal{V}}$ from the online encoder are optionally projected to the predictor dimension $D_p$.
  2. Learnable mask tokens $m \in \mathbb{R}^{D_p}$ (a single shared learnable vector) are placed at each masked position $i \in \mathcal{M}$.
  3. Positional embeddings are added to both visible representations and mask tokens, encoding each token's spatial (and temporal, for video) position.
  4. The full sequence of $N$ tokens (visible representations + mask token placeholders) is processed by the predictor Transformer.
  5. The outputs at masked positions are extracted as the predicted representations $\{\hat{s}_i\}_{i \in \mathcal{M}}$.

An important design choice is the information bottleneck: the predictor's reduced dimension $D_p < D$ prevents the predictor from becoming so expressive that it can solve the prediction task independently of the encoder. If the predictor were as large as the encoder, it could memorize trivial mappings from positions to representations without requiring the encoder to learn useful features. The narrow predictor forces the encoder to provide genuinely informative visible-token representations.

WHY: The predictor's design is critical for the dense loss to propagate useful gradients back to the encoder. If the predictor has too much capacity, it can solve the per-token prediction task without the encoder needing to encode local spatial details—defeating the purpose of the dense loss. V-JEPA 2.1 uses a narrower predictor than the encoder specifically to ensure that per-token prediction accuracy depends on the encoder providing spatially precise visible representations. The authors' ablations (discussed in Section 9) confirm that predictor width significantly affects dense task performance.

4.4 Masking Strategy

WHAT: V-JEPA 2.1 employs multi-block masking in the spatiotemporal token space, following the strategy established in V-JEPA and V-JEPA 2. Multiple contiguous blocks of tokens are randomly selected as the mask set $\mathcal{M}$, with a high overall mask ratio (typically 75–90% of tokens are masked).

HOW: For video input with tokens arranged on a $T' \times H' \times W'$ grid (where $T' = T/t_p$, $H' = H/p$, $W' = W/p$), mask blocks are sampled as follows:

  1. Sample $K$ target blocks (typically $K = 4$), each with random spatial extent $(h_k, w_k)$ sampled uniformly from $[0.15, 0.7]$ of the spatial grid dimensions, and temporal extent $t_k$ from $[0.5, 1.0]$ of the temporal grid.
  2. Sample random anchor positions for each block.
  3. The union of all target block positions forms $\mathcal{M}$; remaining positions form $\mathcal{V}$.
  4. For images (treated as single-frame video), only spatial block dimensions are relevant.

The masking strategy is not changed from V-JEPA 2. The key insight of V-JEPA 2.1 is that the same masking, paired with a different loss, produces qualitatively different representations.

Multi-Block Masking & Dense vs. Block-Level Loss Input Token Grid (H'×W') Block 1 masked Block 2 Block 3 B4 ~80% masked (green) ~20% visible (dark) V-JEPA 2: Block-Level avg Loss on block avg → diluted spatial signal V-JEPA 2.1: Dense Token ℓ₁ ℓ₂ ℓ₃ ℓ₄ ℓ₅ ℓ₆ Loss per token → direct spatial gradient Impact on Learned Features Block-level: tokens encode "what region contains" → good for classification, weak for localization Dense: tokens encode "what is at this exact position" → good for classification AND localization Same masking, same architecture — only the loss granularity changes
Figure 2. Multi-block masking and the key distinction between V-JEPA 2 (block-level loss, left) and V-JEPA 2.1 (dense token-level loss, right). The masking is identical; the difference is entirely in loss granularity. Block-level loss averages across tokens in a block before computing the loss, diluting per-token spatial gradients. Dense loss evaluates each token independently, providing direct spatial supervision.

WHY: The high mask ratio (75–90%) is essential for making the prediction task non-trivial: if most tokens are visible, the predictor can copy nearby representations rather than learning genuine prediction. The multi-block structure ensures that the masked regions span diverse spatial and temporal extents, preventing the model from overfitting to a single masking pattern. The V-JEPA 2.1 authors do not modify the masking strategy itself, which isolates the effect of the dense loss as the sole explanatory variable for improved spatial grounding.

4.5 Loss Function

WHAT: The V-JEPA 2.1 loss is a dense token-level prediction loss that measures the discrepancy between each individual predicted masked-token representation and its corresponding target representation. The total training loss is the average of per-token losses over all masked positions across all target blocks.

Full Mathematical Definition:

Let the input $x$ be an image or video clip. Let $\mathcal{M} = \{i_1, i_2, \ldots, i_{N_m}\}$ denote the set of masked token indices and $\mathcal{V} = \{j_1, j_2, \ldots, j_{N_v}\}$ denote the set of visible token indices, with $\mathcal{M} \cup \mathcal{V} = \{1, \ldots, N\}$ and $\mathcal{M} \cap \mathcal{V} = \emptyset$.

Define:

  • $f_\theta$: online encoder with parameters $\theta$
  • $f_\xi$: target encoder with EMA parameters $\xi$
  • $g_\phi$: predictor with parameters $\phi$
  • $\text{PE}(i) \in \mathbb{R}^D$: positional embedding for token position $i$
  • $m \in \mathbb{R}^{D_p}$: learnable mask token vector (shared across all masked positions)
  • $\text{LN}(\cdot)$: layer normalization applied to target representations

The online encoder processes visible tokens:

$$\{h_j\}_{j \in \mathcal{V}} = f_\theta\left(\{x_j + \text{PE}(j)\}_{j \in \mathcal{V}}\right) \quad \text{where } h_j \in \mathbb{R}^D$$

The target encoder processes all tokens (with stop-gradient, denoted $\text{sg}$):

$$\{s_i\}_{i=1}^{N} = \text{sg}\left[\text{LN}\left(f_\xi\left(\{x_i + \text{PE}(i)\}_{i=1}^{N}\right)\right)\right] \quad \text{where } s_i \in \mathbb{R}^D$$

The predictor takes visible representations and mask token placeholders to produce predictions at masked positions:

$$\{\hat{s}_i\}_{i \in \mathcal{M}} = g_\phi\left(\{h_j\}_{j \in \mathcal{V}},\ \{m + \text{PE}(i)\}_{i \in \mathcal{M}}\right) \quad \text{where } \hat{s}_i \in \mathbb{R}^D$$

The dense loss is computed as the average per-token loss over all masked positions:

$$\mathcal{L}_{\text{dense}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \ell\left(\hat{s}_i,\ s_i\right)$$

where $\ell(\cdot, \cdot)$ is a per-token loss function. Candidates used in the JEPA family include:

Smooth $\ell_1$ (Huber) loss:

$$\ell_{\text{smooth-L1}}(\hat{s}_i, s_i) = \frac{1}{D} \sum_{d=1}^{D} \begin{cases} \frac{1}{2\beta}(\hat{s}_{i,d} - s_{i,d})^2 & \text{if } |\hat{s}_{i,d} - s_{i,d}| < \beta \\ |\hat{s}_{i,d} - s_{i,d}| - \frac{\beta}{2} & \text{otherwise} \end{cases}$$

where $\hat{s}_{i,d}$ and $s_{i,d}$ denote the $d$-th coordinate of the predicted and target vectors respectively, and $\beta > 0$ is the transition threshold (commonly $\beta = 2.0$ in JEPA models).

$\ell_2$ (MSE) loss:

$$\ell_{\text{L2}}(\hat{s}_i, s_i) = \frac{1}{D} \|\hat{s}_i - s_i\|_2^2 = \frac{1}{D} \sum_{d=1}^{D} (\hat{s}_{i,d} - s_{i,d})^2$$

When multiple target blocks $\{B_1, \ldots, B_K\}$ are sampled and the overall mask set is $\mathcal{M} = \bigcup_{k=1}^{K} B_k$, the loss sums over all tokens across all blocks uniformly:

$$\mathcal{L} = \frac{1}{\sum_{k=1}^{K} |B_k|} \sum_{k=1}^{K} \sum_{i \in B_k} \ell(\hat{s}_i, s_i)$$

Variables summary:

SymbolMeaningShape/Range
$x$Input image or video$C \times T \times H \times W$
$N$Total number of patch tokens$(T/t_p) \times (H/p) \times (W/p)$
$N_m, N_v$Number of masked / visible tokens$N_m + N_v = N$
$\mathcal{M}, \mathcal{V}$Sets of masked / visible token indices$|\mathcal{M}| = N_m$
$D$Encoder embedding dimension1280 (ViT-H) or 1408 (ViT-G)
$D_p$Predictor embedding dimensione.g., 384
$h_j$Encoder output for visible token $j$$\mathbb{R}^D$
$s_i$Target representation for token $i$$\mathbb{R}^D$
$\hat{s}_i$Predicted representation for masked token $i$$\mathbb{R}^D$
$\theta, \xi, \phi$Parameters of encoder, target encoder, predictor
$\tau$EMA momentum coefficient$[0.996, 1.0]$
$\beta$Smooth-$\ell_1$ transition thresholde.g., 2.0
$K$Number of target blocks per samplee.g., 4

WHY: The critical distinction from V-JEPA 2 is that in V-JEPA 2, the per-block loss may involve averaging or pooling token representations within a target block before computing the distance, or weighting the loss in a way that does not produce independent per-token gradients. In V-JEPA 2.1, each token $i \in \mathcal{M}$ contributes its own independent loss term. This creates a gradient signal $\nabla_\theta \ell(\hat{s}_i, s_i)$ for every masked token, which propagates through the predictor and into the encoder. The encoder therefore receives $N_m$ independent spatial gradient signals per training step, each anchored to a specific spatial (and temporal) position. This is the mechanism by which the dense loss produces spatially grounded features.

4.6 Dense Prediction Head (Variant-Specific)

WHAT: To evaluate the spatial quality of learned features on downstream dense tasks, V-JEPA 2.1 employs lightweight dense prediction heads attached to the frozen pretrained encoder. These are not part of pretraining but are part of the evaluation protocol that validates the dense loss design.

HOW: For object detection and instance segmentation, the authors use a ViTDet-style framework (Li et al., 2022): the ViT backbone is used as a feature extractor, and multi-scale features are constructed by applying simple feature pyramid operations to the encoder's output token grid. A Cascade Mask R-CNN or similar detection head operates on these multi-scale features. For semantic segmentation, a linear decoder or lightweight UPerNet head maps per-token features to class labels at each spatial position.

The key evaluation insight is that the encoder's output tokens, when reshaped to their spatial grid positions, should form a spatially coherent feature map if the dense loss has succeeded. A simple linear mapping from each token's representation to a class label should suffice for semantic segmentation if the tokens are spatially grounded. The authors compare this against V-JEPA 2 features, where such simple linear mappings perform poorly due to the lack of per-token spatial specificity.

WHY: This evaluation protocol directly tests the thesis of V-JEPA 2.1: that per-token dense loss produces spatially grounded features usable for dense prediction. By holding the evaluation protocol constant and varying only the pretraining loss, the authors isolate the causal effect of the dense loss on downstream spatial task performance.

5. Implementation Details

The following table summarizes the key hyperparameters for V-JEPA 2.1 pretraining. Where exact values are not publicly confirmed, ranges consistent with V-JEPA 2 and the paper's reported methodology are indicated.

HyperparameterViT-H/16ViT-G/14
Encoder layers3240
Encoder heads1616
Encoder dim ($D$)12801408
Patch size ($p$)16×1614×14
Temporal patch size ($t_p$)22
Predictor layers1212
Predictor dim ($D_p$)384384
OptimizerAdamW ($\beta_1 = 0.9$, $\beta_2 = 0.95$)
Base learning rate$1.5 \times 10^{-4}$ (scaled by batch size / 256)
LR scheduleCosine decay with linear warmup
Warmup epochs40
Total epochs (image)300
Total epochs (video)Equivalent iterations on video data
Batch size2048 (images) / 256–512 (video clips)
Weight decay0.05
EMA schedule $\tau$Cosine from 0.996 → 1.0
Mask ratio~80% (multi-block, $K = 4$ target blocks)
Image resolution224×224224×224 (with 448×448 fine-tuning)
Video frames16 frames (2 fps from original video)
Training dataCombined: ImageNet-22k (images) + video datasets (e.g., VideoMix2M or similar)
GPUs64–128 A100 80GB (estimated from V-JEPA 2 scale)
PrecisionMixed precision (bfloat16)
Loss function $\ell$Smooth-$\ell_1$ ($\beta = 2.0$) or $\ell_2$

Note on data: V-JEPA 2.1 jointly trains on images and video. In each training iteration, a batch may contain a mix of image and video samples. Image samples are treated as single-frame clips. The dense loss is applied identically to both, ensuring that the spatial grounding learned from high-resolution, diverse images transfers to video frames.

No public repository is available for V-JEPA 2.1 as of the paper's release. The implementation likely extends the Meta JEPA codebase used for V-JEPA and V-JEPA 2.

6. Algorithm

Algorithm 1: V-JEPA 2.1 — Pretraining with Dense Token-Level Loss
Input: Image dataset $\mathcal{D}_{\text{img}}$, Video dataset $\mathcal{D}_{\text{vid}}$, online encoder $f_\theta$, target encoder $f_\xi$, predictor $g_\phi$
Hyperparameters: EMA schedule $\tau(t)$, learning rate schedule $\eta(t)$, mask ratio $r$, number of blocks $K$, loss threshold $\beta$, total steps $T_{\max}$
Output: Pretrained encoder $f_\theta$
 
1 Initialize $\xi \leftarrow \theta$ // target encoder starts as copy of online encoder
2 for $t = 1$ to $T_{\max}$ do
3 Sample mini-batch: $\{x^{(b)}\}_{b=1}^{B}$ from $\mathcal{D}_{\text{img}} \cup \mathcal{D}_{\text{vid}}$
4 for each sample $x^{(b)}$ in batch do
5 Patchify and embed: $\{e_i\}_{i=1}^{N} = \text{PatchEmbed}(x^{(b)}) + \{\text{PE}(i)\}_{i=1}^{N}$
6 Sample $K$ target blocks → mask set $\mathcal{M}^{(b)}$, visible set $\mathcal{V}^{(b)} = \{1,\ldots,N\} \setminus \mathcal{M}^{(b)}$
7 // --- Online branch (with gradient) ---
8 $\{h_j\}_{j \in \mathcal{V}} = f_\theta(\{e_j\}_{j \in \mathcal{V}})$ // encode visible tokens only
9 $\{\hat{s}_i\}_{i \in \mathcal{M}} = g_\phi(\{h_j\}_{j \in \mathcal{V}},\ \{m + \text{PE}(i)\}_{i \in \mathcal{M}})$ // predict each masked token
10 // --- Target branch (no gradient) ---
11 with no_grad():
12 $\{s_i\}_{i=1}^{N} = \text{LN}(f_\xi(\{e_i\}_{i=1}^{N}))$ // target encoder processes all tokens
13 // --- Dense token-level loss ---
14 $\mathcal{L}^{(b)} = \frac{1}{|\mathcal{M}^{(b)}|} \sum_{i \in \mathcal{M}^{(b)}} \ell_{\text{smooth-L1}}(\hat{s}_i, s_i; \beta)$
15 end for
16 $\mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \mathcal{L}^{(b)}$ // batch-average loss
17 Compute gradients: $\nabla_\theta \mathcal{L},\ \nabla_\phi \mathcal{L}$
18 Update $\theta, \phi$ via AdamW with learning rate $\eta(t)$
19 Update EMA: $\xi \leftarrow \tau(t) \cdot \xi + (1 - \tau(t)) \cdot \theta$ // target encoder update
20 end for
21 return $f_\theta$
Algorithm 2: Dense Feature Extraction for Downstream Dense Prediction
Input: Pretrained encoder $f_\theta$, input image/video $x$, downstream task head $\psi$ (e.g., detector, segmentor)
Output: Dense predictions (bounding boxes, masks, or per-pixel labels)
 
1 Patchify and embed: $\{e_i\}_{i=1}^{N} = \text{PatchEmbed}(x) + \{\text{PE}(i)\}_{i=1}^{N}$
2 // At inference, NO masking — all tokens are visible
3 $\{h_i\}_{i=1}^{N} = f_\theta(\{e_i\}_{i=1}^{N})$ // encode all tokens
4 // Reshape tokens to spatial feature map
5 $F \in \mathbb{R}^{H' \times W' \times D} = \text{Reshape}(\{h_i\}_{i=1}^{N})$ // for images; (T'×H'×W'×D) for video
6 // --- For object detection (ViTDet-style) ---
7 if task == detection:
8 $\{F_1, F_2, F_3, F_4\} = \text{SimpleFPN}(F)$ // multi-scale feature pyramid from single-scale ViT features
9 $\text{boxes}, \text{masks} = \text{CascadeMaskRCNN}(\{F_l\}_{l=1}^{4})$
10 else if task == segmentation:
11 $\text{labels} = \psi(F)$ // linear or UPerNet head: per-token → per-pixel labels
12 else if task == classification:
13 $y = \text{AttentiveProbe}(\{h_i\}_{i=1}^{N})$ // global pooling for classification (as in V-JEPA 2)
14 return predictions

7. Training

Step-by-Step: One Training Iteration

Step 1 — Data sampling. A mini-batch of $B$ samples is drawn from the combined image-video dataset. Each sample is either an image (treated as a single frame) or a video clip of $T$ frames. Standard augmentations (random resized crop, horizontal flip, color jitter for images; temporal subsampling and spatial crop for video) are applied.

Step 2 — Patchification. Each sample is decomposed into non-overlapping patches. For an image of size $224 \times 224$ with patch size $16 \times 16$, this yields $N = 14 \times 14 = 196$ tokens. For a video clip of 16 frames with temporal patch size 2, this yields $N = 8 \times 14 \times 14 = 1568$ tokens. Each patch is linearly projected to dimension $D$, and positional embeddings are added.

Step 3 — Multi-block mask generation. For each sample, $K = 4$ target blocks are sampled with random spatial and temporal extents. The union of block positions defines $\mathcal{M}$, targeting approximately 80% of tokens. The remaining tokens form $\mathcal{V}$.

Step 4 — Online encoder forward pass. Only visible tokens $\{e_j\}_{j \in \mathcal{V}}$ are input to the online encoder $f_\theta$. The output is a set of $N_v$ representations $\{h_j\}_{j \in \mathcal{V}} \in \mathbb{R}^{N_v \times D}$. Because masked tokens are excluded, this step has computational cost proportional to $N_v \approx 0.2N$, significantly cheaper than processing all $N$ tokens.

Step 5 — Predictor forward pass. The predictor $g_\phi$ takes the $N_v$ visible representations and $N_m$ mask-token placeholders (each initialized as the shared learnable mask token $m$ plus the positional embedding of the masked position). The full sequence of $N$ tokens (visible representations + mask placeholders) is processed by the predictor Transformer. Outputs at the $N_m$ masked positions are extracted as predictions $\{\hat{s}_i\}_{i \in \mathcal{M}} \in \mathbb{R}^{N_m \times D}$.

Step 6 — Target encoder forward pass (no gradient). The target encoder $f_\xi$ processes all $N$ tokens (no masking). This is the most expensive forward pass per sample, as it processes the full token set. The output is layer-normalized to produce targets $\{s_i\}_{i=1}^{N} \in \mathbb{R}^{N \times D}$. Only the targets at masked positions $\{s_i\}_{i \in \mathcal{M}}$ are retained for loss computation. No gradients are computed for this step.

Step 7 — Dense loss computation. For each masked token $i \in \mathcal{M}$, the smooth-$\ell_1$ loss between $\hat{s}_i$ and $s_i$ is computed elementwise across the $D$ dimensions and averaged. The per-sample loss is the mean over all $N_m$ per-token losses. The batch loss is the mean over all $B$ samples.

Step 8 — Backward pass. Gradients $\nabla_\theta \mathcal{L}$ and $\nabla_\phi \mathcal{L}$ are computed. The dense loss produces $N_m$ independent gradient pathways—one per masked token—each flowing through the predictor into the encoder. This is the key mechanism: unlike a block-averaged loss that would merge gradients from co-located tokens, the dense loss ensures each token position contributes a distinct gradient signal to the encoder weights.

Step 9 — Parameter update. The online encoder $\theta$ and predictor $\phi$ parameters are updated using AdamW with the scheduled learning rate $\eta(t)$.

Step 10 — EMA update. The target encoder parameters are updated: $\xi \leftarrow \tau(t) \xi + (1 - \tau(t)) \theta$. The momentum $\tau(t)$ follows the cosine schedule from $\tau_0$ to $\tau_1$.

Training Diagram with Gradient Flow

V-JEPA 2.1 — One Training Iteration (Gradient Flow) Step 1–2 Step 3 Step 4 Step 5 Step 7 Image/Video Batch B × C × T × H × W Patch Embed + PE B × N × D Multi-Block Mask Split K=4 blocks, ~80% masked visible B×N_v×D all B×N×D Online Encoder f_θ ViT-H: 32L, D=1280 ✓ gradient Target Encoder f_ξ EMA of f_θ ✗ no gradient (sg) Step 10: ξ ← τξ + (1−τ)θ B×N_v×D B×N×D Predictor g_ϕ 12L, D_p=384 (bottleneck) visible reps + mask_token placeholders → ŝ_i per masked pos Select + LayerNorm targets at M: B×N_m×D ŝ: B×N_m×D s: B×N_m×D Dense Token-Level Loss L = (1/N_m) Σ ℓ(ŝ_i, s_i) ∇_θ L, ∇_ϕ L N_m independent spatial gradient signals → encoder AdamW Update (θ, ϕ) no gradient flows to target encoder
Figure 3. One training iteration of V-JEPA 2.1 with gradient flow annotations. Green solid lines indicate paths through which gradients flow; dashed green lines show gradient propagation. Dashed borders indicate frozen (EMA-updated) components. The dense loss produces $N_m$ independent per-token gradient signals, each anchoring a specific spatial position in the encoder's representations.

8. Inference

At inference time, V-JEPA 2.1 uses only the pretrained online encoder $f_\theta$. The predictor and target encoder are discarded. Crucially, no masking is applied during inference: the encoder processes all $N$ tokens from the input image or video clip.

Inference Protocol for Dense Tasks

For dense prediction tasks (object detection, instance segmentation, semantic segmentation), the encoder's output tokens $\{h_i\}_{i=1}^{N} \in \mathbb{R}^{N \times D}$ are reshaped to their corresponding spatial (or spatiotemporal) grid positions, forming a feature map $F \in \mathbb{R}^{H' \times W' \times D}$. This feature map is then consumed by task-specific heads:

  • Object detection / instance segmentation: A Simple Feature Pyramid Network (SimpleFPN) constructs multi-scale feature maps from the single-scale ViT output. These feed into Cascade Mask R-CNN or a similar detection head. The encoder is either frozen (linear probe protocol) or fine-tuned end-to-end.
  • Semantic segmentation: A linear head or UPerNet decoder maps each spatial token to per-class logits. The predictions are upsampled to input resolution for evaluation.

Inference Protocol for Global Tasks

For classification tasks (action recognition, image classification), the same attentive probing or linear probing used in V-JEPA 2 is applied. The encoder outputs are aggregated via average pooling or a learned attention pooling layer, producing a single $D$-dimensional vector per sample, which is fed to a linear classifier.

Inference Pipeline Diagram

V-JEPA 2.1 — Inference Pipeline Image or Video C×T×H×W Patch Embed + PE No masking! All N tokens used Encoder f_θ (pretrained, frozen or fine-tuned) B×N×D Reshape to H'×W'×D spatial feature map SimpleFPN {F_1..F_4} multi-scale Cascade MRCNN boxes + masks Linear / UPerNet per-pixel labels Attentive Pool / Avg B×D (global) Linear Classifier Dense Tasks Global Tasks Discarded at inference: • Predictor g_ϕ • Target encoder f_ξ
Figure 4. V-JEPA 2.1 inference pipeline. At inference, only the pretrained encoder is used—no masking, no predictor, no target encoder. All $N$ tokens are processed. For dense tasks (detection, segmentation), token outputs are reshaped to a spatial feature map and consumed by task-specific heads. For global tasks (classification), tokens are pooled and fed to a linear classifier.

Downstream Evaluation Protocols

ProtocolEncoderHeadEvaluation Benchmark
Frozen linear probe (classification)FrozenLinear classifier on pooled featuresImageNet-1k, Kinetics-400/600
Frozen attentive probe (classification)FrozenLearned attention pooling + linearImageNet-1k, Kinetics-400/600
Fine-tuned (classification)End-to-end fine-tunedLinear headImageNet-1k, Kinetics-400/600/700
Frozen + ViTDet (detection)Frozen backboneSimpleFPN + Cascade Mask R-CNNCOCO detection, LVIS
Fine-tuned ViTDet (detection)Fine-tuned backboneSimpleFPN + Cascade Mask R-CNNCOCO detection, LVIS
Frozen linear (segmentation)FrozenLinear per-token headADE20K semantic segmentation
Fine-tuned UPerNet (segmentation)Fine-tunedUPerNet decoderADE20K

9. Results & Benchmarks

V-JEPA 2.1 is evaluated against V-JEPA 2 and other self-supervised methods across both global (classification) and dense (detection, segmentation) benchmarks. The central claim is that the dense loss improves dense task performance substantially while maintaining competitive classification performance.

9.1 Dense Prediction Benchmarks

The primary evaluation axis for V-JEPA 2.1 is dense prediction. The following tables summarize results as reported in the paper.

COCO Object Detection and Instance Segmentation

MethodBackbonePretrainingAPboxAPmask
MAE (He et al., 2022)ViT-HImageNet-1k56.348.8
DINOv2 (Oquab et al., 2024)ViT-gLVD-142M58.550.6
V-JEPA 2ViT-HImage+Video54.847.4
V-JEPA 2.1ViT-HImage+Video57.950.1
V-JEPA 2.1ViT-GImage+Video59.251.3

The dense loss in V-JEPA 2.1 yields a substantial improvement over V-JEPA 2: approximately +3.1 APbox and +2.7 APmask at the ViT-H scale. This closes the gap with pixel-reconstruction methods like MAE and approaches the performance of DINOv2, which was explicitly designed for dense features through its DINO+iBOT combination of losses.

ADE20K Semantic Segmentation

MethodBackboneDecodermIoU
MAEViT-HUPerNet53.6
DINOv2ViT-gLinear53.0
V-JEPA 2ViT-HLinear46.2
V-JEPA 2.1ViT-HLinear52.4
V-JEPA 2.1ViT-HUPerNet54.8

The linear segmentation probe is a particularly revealing metric: it tests whether the encoder's per-token features are directly usable for pixel-level classification without any learned spatial reasoning in the decoder. V-JEPA 2's linear probe mIoU of ~46.2 reflects poor spatial grounding. V-JEPA 2.1 improves this by approximately +6.2 mIoU with the linear probe, confirming that the dense loss produces substantially more spatially grounded features.

9.2 Classification Benchmarks

A key question is whether the dense loss degrades global classification performance. The authors report that V-JEPA 2.1 maintains competitive classification accuracy:

MethodBackboneImageNet-1k (top-1)K400 (top-1)K600 (top-1)
V-JEPA 2ViT-H84.285.887.1
V-JEPA 2.1ViT-H84.085.687.0

The classification performance is essentially maintained (within ~0.2% on ImageNet, ~0.2% on Kinetics-400), confirming that the dense loss does not trade off global understanding for spatial grounding. This is a non-trivial result: one might expect that forcing per-token spatial fidelity could reduce the encoder's capacity to aggregate global context, but the results suggest that per-token spatial precision and global semantic understanding are complementary rather than competing objectives.

9.3 Ablations

The paper includes ablation studies that isolate the effect of key design choices:

Dense vs. Block-Level Loss

Loss TypeCOCO APboxADE20K mIoU (linear)K400 top-1
Block-level (V-JEPA 2 style)54.846.285.8
Dense token-level (V-JEPA 2.1)57.952.485.6

With all other factors held constant (same encoder, masking, data), switching from block-level to dense token-level loss produces +3.1 APbox on COCO and +6.2 mIoU on ADE20K with negligible classification loss. This is the core ablation validating the paper's thesis.

Image-Only vs. Joint Image-Video Dense Training

Training DataCOCO APboxK400 top-1
Image-only (dense loss)57.283.4
Video-only (dense loss)55.185.4
Joint image+video (dense loss)57.985.6

Joint training combines the strengths of both modalities: images contribute diverse spatial content for dense tasks, while video contributes temporal structure for action recognition. Image-only training with dense loss already improves substantially over V-JEPA 2 on COCO but degrades video classification. Joint training achieves the best of both.

Predictor Width

Predictor dim $D_p$COCO APboxADE20K mIoU (linear)
19256.851.0
38457.952.4
76856.550.2
1280 (= $D$, no bottleneck)55.348.1

This ablation confirms the importance of the predictor bottleneck. When $D_p = D$ (no bottleneck), the predictor has enough capacity to solve the per-token prediction task independently, reducing the pressure on the encoder to maintain spatially grounded features. The sweet spot at $D_p = 384$ (approximately $D/3.3$) maximally incentivizes the encoder to carry spatial information.

Mask Ratio

Mask ratioCOCO APboxK400 top-1
60%56.284.8
75%57.485.3
80%57.985.6
90%57.585.1

The optimal mask ratio is approximately 80%, consistent with V-JEPA 2 findings. Lower ratios make the prediction task too easy (visible tokens provide too much local context). Higher ratios make the task excessively difficult, potentially causing the model to rely on statistical shortcuts rather than spatial understanding.

10. Connection to JEPA Family

Lineage

V-JEPA 2.1 sits at the end of a clear lineage within the JEPA family:

  1. JEPA position paper (LeCun, 2022): Proposed the Joint-Embedding Predictive Architecture as a framework for self-supervised learning that predicts in representation space rather than input space, avoiding the pitfalls of generative pixel-level modeling.
  2. I-JEPA (Assran et al., 2023): First concrete implementation of JEPA for images. Demonstrated that masking and predicting in latent space learns semantic image representations without pixel reconstruction, data augmentation, or negative pairs.
  3. V-JEPA (Bardes et al., 2024): Extended the I-JEPA framework to video, showing that spatiotemporal masking in latent space captures temporal dynamics and achieves strong action recognition.
  4. V-JEPA 2 (Bardes et al., 2025): Scaled V-JEPA to larger backbones and combined image-video training, achieving state-of-the-art self-supervised performance on classification benchmarks with attentive probing.
  5. V-JEPA 2.1 (Mur-Labadia et al., 2026): Addresses the spatial grounding limitation of V-JEPA 2 by introducing dense token-level prediction loss, unlocking strong performance on detection and segmentation without sacrificing classification quality.
Key Novelty: V-JEPA 2.1 demonstrates that the JEPA framework—which by design never touches pixels—can produce dense, spatially grounded features competitive with pixel-reconstruction methods on detection and segmentation benchmarks. The insight that loss granularity (per-token vs. per-block) is the critical variable for spatial grounding, rather than architectural changes or reconstruction targets, is a significant conceptual contribution. It shows that the gap between JEPA and pixel-reconstruction methods on dense tasks was not a fundamental limitation of latent prediction, but an artifact of how prediction quality was measured during training.

Relationship to Other JEPA Variants

V-JEPA 2.1's dense loss design can be understood in the context of other JEPA-family innovations:

  • I-JEPA used per-token prediction losses from the beginning (smooth-$\ell_1$ loss on individual target patch tokens). However, I-JEPA operated only on images and at smaller scale. V-JEPA 2.1 can be seen as returning to this per-token loss design after V-JEPA 2 moved toward block-level aggregation for computational or optimization reasons, and demonstrating its importance for dense tasks at scale.
  • DINOv2 (Oquab et al., 2024) achieves strong dense features through a different mechanism: combining DINO's [CLS]-token distillation with iBOT's per-token masked prediction in pixel space. V-JEPA 2.1 achieves comparable dense feature quality while remaining purely in latent space, without pixel-level reconstruction or separate global/local objectives.
  • The dense loss in V-JEPA 2.1 also has connections to MC-JEPA and other multi-component JEPA variants that use per-token losses for reconstruction. The contribution is showing that this approach scales to the V-JEPA 2 framework and produces competitive results on standard dense prediction benchmarks.

Influence and Future Directions

V-JEPA 2.1 opens several directions for the JEPA family:

  • Unified models: The demonstration that a single pretrained encoder can serve both global and dense tasks suggests a path toward truly general-purpose visual encoders within the JEPA framework.
  • Dense temporal grounding: Extending the dense loss to explicitly evaluate temporal fidelity (not just spatial) could improve fine-grained temporal reasoning in video.
  • Integration with world models: If dense features improve object tracking and scene understanding, they could benefit JEPA-based planning and world models (as explored in V-JEPA 2's planning experiments) by providing more spatially precise state representations.

11. Summary

Key Takeaway: V-JEPA 2.1 demonstrates that the JEPA framework can produce spatially grounded, dense features suitable for object detection and segmentation—tasks previously thought to require pixel-level reconstruction objectives—by making a single, targeted change: evaluating the prediction loss at the individual token level rather than the block level.

Main Contribution: The dense token-level prediction loss creates direct per-position gradient signals that force the encoder to maintain spatial specificity at every patch token. Combined with joint image-video training, this produces representations that simultaneously excel on global tasks (action recognition: ~85.6% on K400) and dense tasks (detection: ~57.9 APbox on COCO; segmentation: ~52.4 mIoU on ADE20K with a linear probe)—a combination that neither V-JEPA 2 nor prior latent-prediction methods achieved.

Broader Significance: V-JEPA 2.1 closes a key gap in the JEPA paradigm: the belief that predicting in representation space inherently sacrifices spatial precision. By showing that loss granularity—not the prediction target domain—determines spatial feature quality, the paper strengthens the case for JEPA as a general-purpose self-supervised learning framework competitive across the full spectrum of visual understanding tasks.

12. References

  1. Mur-Labadia, A., Muckley, M., Bar, A., Assran, M., Sinha, A., Rabbat, M., LeCun, Y., Ballas, N., & Bardes, A. (2026). V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning. arXiv preprint arXiv:2603.14482.
  2. Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction, and Planning. arXiv preprint.
  3. Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ECCV 2024.
  4. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
  5. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
  6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
  7. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.
  8. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.
  9. Li, Y., Mao, H., Girshick, R., & He, K. (2022). Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
  10. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified Perceptual Parsing for Scene Understanding. ECCV 2018.
  11. Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pinto, B. A., Zhmoginov, A., Tachet des Combes, R., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
  12. Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.