1. Introduction
Self-supervised learning from video has advanced rapidly through the Joint-Embedding Predictive Architecture (JEPA) family. V-JEPA (Bardes et al., 2024) demonstrated that predicting masked spatiotemporal regions in latent space—rather than reconstructing raw pixels—yields powerful video representations for action recognition and temporal reasoning. V-JEPA 2 (Bardes et al., 2025) scaled this paradigm to combined image-and-video pretraining with larger Vision Transformer backbones, achieving strong performance on global classification benchmarks. However, a persistent limitation remained: while V-JEPA 2 excels at holistic scene understanding (e.g., classifying an action or identifying a scene category), its learned representations lack the spatial grounding necessary for dense prediction tasks such as object detection, instance segmentation, and semantic segmentation. The features, aggregated through global pooling or attentive probes, lose the fine-grained per-patch spatial information that dense tasks demand.
V-JEPA 2.1 (Mur-Labadia, Muckley, Bar, Assran, Sinha, Rabbat, LeCun, Ballas, & Bardes, 2026) directly addresses this gap. The core insight is that the standard JEPA prediction loss—which evaluates reconstruction quality at the level of masked block regions pooled or averaged over target tokens—does not explicitly incentivize the encoder to maintain spatially precise representations at each individual patch token. V-JEPA 2.1 introduces a dense predictive loss that operates at the per-token level: the predictor must reconstruct every individual target token with high fidelity, not merely match an aggregate representation of the masked region. This seemingly simple modification has profound consequences for the spatial quality of learned features.
The key contributions of V-JEPA 2.1, as described in the paper, are:
- Dense prediction loss design. A token-level prediction objective that complements the standard JEPA block-level loss, forcing the encoder to retain patch-level spatial information throughout the network.
- Combined image and video training with dense objectives. The dense loss is applied uniformly across both image and video inputs during pretraining, enabling spatial grounding to transfer from the high-resolution spatial detail in images to the spatiotemporal structure of video.
- Spatial grounding from dense features. The resulting encoder produces features that are directly useful for object detection (e.g., with ViTDet-style heads) and segmentation (e.g., with linear or lightweight decoders), without requiring task-specific architectural modifications.
- Pixel-level understanding from JEPA. V-JEPA 2.1 demonstrates that the JEPA framework—which by design never reconstructs pixels—can nonetheless produce representations competitive with pixel-reconstruction methods (e.g., MAE, VideoMAE) on dense prediction benchmarks, while retaining the superior semantic quality of latent prediction for classification tasks.
Compared to V-JEPA 2, the architectural changes are minimal: the same encoder, target encoder, and predictor backbone are reused. The critical difference lies in the loss computation. V-JEPA 2 evaluates prediction quality over masked blocks, which may allow spatial information to "wash out" across tokens within a block. V-JEPA 2.1 evaluates prediction quality per token, creating a direct gradient signal that rewards spatially precise representations at every position in the feature map. The result is a model that simultaneously achieves strong performance on global tasks (action recognition, image classification) and dense tasks (detection, segmentation)—a combination that prior JEPA variants could not achieve without sacrificing one axis of performance.
2. Method
Intuitive Overview
Consider a jigsaw puzzle analogy. In V-JEPA 2, you are given a partially completed puzzle and asked: "Does this missing region show a dog or a cat?" You can answer correctly by recognizing coarse patterns—fur color, ear shape, background context. But you do not need to know the exact placement of each individual piece within the region. In V-JEPA 2.1, the question changes: "For each missing piece, tell me exactly what appears on it—the precise texture, edge, and content." Now you must internalize fine-grained spatial details, not just the gist.
The method retains the core JEPA structure: an online encoder processes visible (unmasked) tokens from an image or video, a predictor takes these visible representations and positional information about the masked locations to predict representations of the masked tokens, and a target encoder (updated via exponential moving average of the online encoder) provides the prediction targets. The loss measures the discrepancy between predictions and targets.
The fundamental methodological change in V-JEPA 2.1 is where and how this discrepancy is measured:
- V-JEPA 2 (block-level loss): The predictor outputs representations for the masked region. The loss is computed after some form of aggregation (e.g., averaging tokens within a target block), or the per-token losses within a block are weighted in a way that emphasizes block-level coherence. The gradient signal per individual token is diluted.
- V-JEPA 2.1 (dense token-level loss): The predictor outputs a representation for each individual masked token. The loss is computed independently at every masked token position. Each token receives a direct, unattenuated gradient signal demanding that its predicted representation match the target encoder's representation at that exact spatial (and temporal) location.
A second important aspect of V-JEPA 2.1 is the joint image-video training with the dense objective. Images provide high spatial resolution and diversity of objects and scenes. Videos provide temporal structure and motion. By applying the dense prediction loss to both modalities, the model learns spatially grounded features that generalize across static and dynamic visual content. The image training particularly benefits dense tasks because images often contain more diverse object configurations per sample than individual video frames.
The method does not introduce new masking strategies beyond those used in V-JEPA 2. The masking operates in the spatiotemporal token space: for video, blocks of tokens spanning spatial and temporal dimensions are masked; for images, spatial blocks are masked. The distinction is purely in how the prediction quality is evaluated—not in what is masked or how masking is applied.
3. Model Overview
At-a-Glance
| Property | V-JEPA 2.1 |
|---|---|
| Input | Images (224×224 or 448×448) and video clips (e.g., 16 frames at 224×224) |
| Masking | Multi-block spatiotemporal masking (same strategy as V-JEPA 2); ~75–90% mask ratio |
| Encoder | Vision Transformer (ViT-H/16 or ViT-G/14); processes only visible (unmasked) tokens |
| Target Encoder | Same architecture as encoder; weights updated via EMA (no gradient) |
| Predictor | Narrow Transformer (e.g., 12 layers, reduced embedding dim); takes visible tokens + mask token placeholders → predicts each masked token |
| Loss | Dense token-level smooth-$\ell_1$ or $\ell_2$ loss between predicted and target representations at every masked position |
| Key Result | Competitive with supervised and pixel-reconstruction methods on COCO detection/segmentation while retaining strong Kinetics-400/600 action recognition accuracy |
| Parameters | ~630M (ViT-H encoder) or ~1.1B+ (ViT-G encoder); predictor adds ~50–100M |
Training Architecture Diagram
4. Main Components of V-JEPA 2.1
4.1 Encoder ($f_\theta$)
WHAT: The encoder is a standard Vision Transformer (ViT) that processes only the visible (unmasked) patch tokens from the input image or video. It maps visible tokens to a sequence of $D$-dimensional representations that capture both local content and global context.
HOW: The encoder follows the ViT architecture (Dosovitskiy et al., 2021). For V-JEPA 2.1, the authors employ ViT-H/16 (hidden dimension $D = 1280$, 32 heads, 32 layers, patch size 16×16) and ViT-G/14 ($D = 1408$, 16 heads, 40 layers, patch size 14×14) configurations, consistent with the V-JEPA 2 backbone. For video input with $T$ frames, spatiotemporal patch embedding produces $N = (T/t_p) \times (H/p) \times (W/p)$ tokens, where $t_p$ is the temporal patch size (typically 2) and $p$ is the spatial patch size. For images, $N = (H/p) \times (W/p)$. After masking removes approximately 75–90% of tokens, the encoder processes only $N_v = N - N_m$ visible tokens, where $N_m$ is the number of masked tokens.
Positional embeddings (sinusoidal or learned, factored into spatial and temporal components for video) are added to patch embeddings before encoder processing. The encoder does not receive any information about the masked positions—it sees only the visible tokens and their positional embeddings.
WHY: Processing only visible tokens is computationally efficient (following MAE-style sparse encoding) and forces the encoder to build representations purely from available context. The choice of ViT-H and ViT-G follows V-JEPA 2; the architectural contribution of V-JEPA 2.1 is not in the encoder design but in how its representations are trained through the dense loss. The authors report that the same encoder architecture, when trained with the dense loss versus the block-level loss, produces representations with substantially different spatial quality—confirming that the loss design, not the architecture, is the key variable.
4.2 Target Encoder ($f_\xi$)
WHAT: The target encoder is an identical copy of the online encoder whose parameters $\xi$ are updated via exponential moving average (EMA) of the online encoder parameters $\theta$. It processes the full set of tokens (both visible and masked positions) from the input to produce target representations. Crucially, no gradients flow through the target encoder.
HOW: At each training step, after the online encoder parameters $\theta$ are updated by the optimizer, the target encoder parameters are updated as:
$$\xi \leftarrow \tau \xi + (1 - \tau) \theta$$where $\tau \in [0, 1)$ is the EMA momentum coefficient. Following prior JEPA work, $\tau$ is typically scheduled from a lower value (e.g., $\tau_0 = 0.996$) to a value close to 1 (e.g., $\tau_1 = 1.0$) over the course of training using a cosine schedule:
$$\tau_t = \tau_1 - (\tau_1 - \tau_0) \cdot \left(\cos\left(\frac{\pi t}{T_{\max}}\right) + 1\right) / 2$$The target encoder receives the full token set $\{x_1, x_2, \ldots, x_N\}$ including tokens at masked positions. Its output provides token-level target representations $\{s_1, s_2, \ldots, s_N\}$, from which only the targets at masked positions $\{s_i\}_{i \in \mathcal{M}}$ are selected for the loss computation. The targets are typically layer-normalized before computing the loss.
WHY: The EMA target encoder serves two functions. First, it provides a slowly-evolving, stable set of targets that prevents the training from collapsing to trivial solutions (all representations becoming identical). Second, by processing all tokens—including those the online encoder never sees—it provides targets that reflect the full visual context, giving the predictor a meaningful reconstruction objective. In V-JEPA 2.1, the target encoder's role becomes even more critical: because the dense loss evaluates every individual masked token, the quality and stability of per-token targets directly determines training dynamics. The cosine EMA schedule (starting with faster target updates, transitioning to near-frozen targets) balances early-stage learning speed with late-stage target stability.
4.3 Predictor ($g_\phi$)
WHAT: The predictor is a lightweight Transformer that takes the online encoder's visible-token representations and positional embeddings for the masked positions, and outputs a predicted representation for each individual masked token.
HOW: The predictor architecture is a standard Transformer with reduced capacity relative to the encoder. It typically uses 12 layers, a hidden dimension $D_p$ smaller than the encoder dimension $D$ (e.g., $D_p = 384$ for ViT-H, forming an information bottleneck), and the same number or fewer attention heads. The predictor input is constructed as follows:
- The visible-token representations $\{h_j\}_{j \in \mathcal{V}}$ from the online encoder are optionally projected to the predictor dimension $D_p$.
- Learnable mask tokens $m \in \mathbb{R}^{D_p}$ (a single shared learnable vector) are placed at each masked position $i \in \mathcal{M}$.
- Positional embeddings are added to both visible representations and mask tokens, encoding each token's spatial (and temporal, for video) position.
- The full sequence of $N$ tokens (visible representations + mask token placeholders) is processed by the predictor Transformer.
- The outputs at masked positions are extracted as the predicted representations $\{\hat{s}_i\}_{i \in \mathcal{M}}$.
An important design choice is the information bottleneck: the predictor's reduced dimension $D_p < D$ prevents the predictor from becoming so expressive that it can solve the prediction task independently of the encoder. If the predictor were as large as the encoder, it could memorize trivial mappings from positions to representations without requiring the encoder to learn useful features. The narrow predictor forces the encoder to provide genuinely informative visible-token representations.
WHY: The predictor's design is critical for the dense loss to propagate useful gradients back to the encoder. If the predictor has too much capacity, it can solve the per-token prediction task without the encoder needing to encode local spatial details—defeating the purpose of the dense loss. V-JEPA 2.1 uses a narrower predictor than the encoder specifically to ensure that per-token prediction accuracy depends on the encoder providing spatially precise visible representations. The authors' ablations (discussed in Section 9) confirm that predictor width significantly affects dense task performance.
4.4 Masking Strategy
WHAT: V-JEPA 2.1 employs multi-block masking in the spatiotemporal token space, following the strategy established in V-JEPA and V-JEPA 2. Multiple contiguous blocks of tokens are randomly selected as the mask set $\mathcal{M}$, with a high overall mask ratio (typically 75–90% of tokens are masked).
HOW: For video input with tokens arranged on a $T' \times H' \times W'$ grid (where $T' = T/t_p$, $H' = H/p$, $W' = W/p$), mask blocks are sampled as follows:
- Sample $K$ target blocks (typically $K = 4$), each with random spatial extent $(h_k, w_k)$ sampled uniformly from $[0.15, 0.7]$ of the spatial grid dimensions, and temporal extent $t_k$ from $[0.5, 1.0]$ of the temporal grid.
- Sample random anchor positions for each block.
- The union of all target block positions forms $\mathcal{M}$; remaining positions form $\mathcal{V}$.
- For images (treated as single-frame video), only spatial block dimensions are relevant.
The masking strategy is not changed from V-JEPA 2. The key insight of V-JEPA 2.1 is that the same masking, paired with a different loss, produces qualitatively different representations.
WHY: The high mask ratio (75–90%) is essential for making the prediction task non-trivial: if most tokens are visible, the predictor can copy nearby representations rather than learning genuine prediction. The multi-block structure ensures that the masked regions span diverse spatial and temporal extents, preventing the model from overfitting to a single masking pattern. The V-JEPA 2.1 authors do not modify the masking strategy itself, which isolates the effect of the dense loss as the sole explanatory variable for improved spatial grounding.
4.5 Loss Function
WHAT: The V-JEPA 2.1 loss is a dense token-level prediction loss that measures the discrepancy between each individual predicted masked-token representation and its corresponding target representation. The total training loss is the average of per-token losses over all masked positions across all target blocks.
Full Mathematical Definition:
Let the input $x$ be an image or video clip. Let $\mathcal{M} = \{i_1, i_2, \ldots, i_{N_m}\}$ denote the set of masked token indices and $\mathcal{V} = \{j_1, j_2, \ldots, j_{N_v}\}$ denote the set of visible token indices, with $\mathcal{M} \cup \mathcal{V} = \{1, \ldots, N\}$ and $\mathcal{M} \cap \mathcal{V} = \emptyset$.
Define:
- $f_\theta$: online encoder with parameters $\theta$
- $f_\xi$: target encoder with EMA parameters $\xi$
- $g_\phi$: predictor with parameters $\phi$
- $\text{PE}(i) \in \mathbb{R}^D$: positional embedding for token position $i$
- $m \in \mathbb{R}^{D_p}$: learnable mask token vector (shared across all masked positions)
- $\text{LN}(\cdot)$: layer normalization applied to target representations
The online encoder processes visible tokens:
$$\{h_j\}_{j \in \mathcal{V}} = f_\theta\left(\{x_j + \text{PE}(j)\}_{j \in \mathcal{V}}\right) \quad \text{where } h_j \in \mathbb{R}^D$$The target encoder processes all tokens (with stop-gradient, denoted $\text{sg}$):
$$\{s_i\}_{i=1}^{N} = \text{sg}\left[\text{LN}\left(f_\xi\left(\{x_i + \text{PE}(i)\}_{i=1}^{N}\right)\right)\right] \quad \text{where } s_i \in \mathbb{R}^D$$The predictor takes visible representations and mask token placeholders to produce predictions at masked positions:
$$\{\hat{s}_i\}_{i \in \mathcal{M}} = g_\phi\left(\{h_j\}_{j \in \mathcal{V}},\ \{m + \text{PE}(i)\}_{i \in \mathcal{M}}\right) \quad \text{where } \hat{s}_i \in \mathbb{R}^D$$The dense loss is computed as the average per-token loss over all masked positions:
$$\mathcal{L}_{\text{dense}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \ell\left(\hat{s}_i,\ s_i\right)$$where $\ell(\cdot, \cdot)$ is a per-token loss function. Candidates used in the JEPA family include:
Smooth $\ell_1$ (Huber) loss:
$$\ell_{\text{smooth-L1}}(\hat{s}_i, s_i) = \frac{1}{D} \sum_{d=1}^{D} \begin{cases} \frac{1}{2\beta}(\hat{s}_{i,d} - s_{i,d})^2 & \text{if } |\hat{s}_{i,d} - s_{i,d}| < \beta \\ |\hat{s}_{i,d} - s_{i,d}| - \frac{\beta}{2} & \text{otherwise} \end{cases}$$where $\hat{s}_{i,d}$ and $s_{i,d}$ denote the $d$-th coordinate of the predicted and target vectors respectively, and $\beta > 0$ is the transition threshold (commonly $\beta = 2.0$ in JEPA models).
$\ell_2$ (MSE) loss:
$$\ell_{\text{L2}}(\hat{s}_i, s_i) = \frac{1}{D} \|\hat{s}_i - s_i\|_2^2 = \frac{1}{D} \sum_{d=1}^{D} (\hat{s}_{i,d} - s_{i,d})^2$$When multiple target blocks $\{B_1, \ldots, B_K\}$ are sampled and the overall mask set is $\mathcal{M} = \bigcup_{k=1}^{K} B_k$, the loss sums over all tokens across all blocks uniformly:
$$\mathcal{L} = \frac{1}{\sum_{k=1}^{K} |B_k|} \sum_{k=1}^{K} \sum_{i \in B_k} \ell(\hat{s}_i, s_i)$$Variables summary:
| Symbol | Meaning | Shape/Range |
|---|---|---|
| $x$ | Input image or video | $C \times T \times H \times W$ |
| $N$ | Total number of patch tokens | $(T/t_p) \times (H/p) \times (W/p)$ |
| $N_m, N_v$ | Number of masked / visible tokens | $N_m + N_v = N$ |
| $\mathcal{M}, \mathcal{V}$ | Sets of masked / visible token indices | $|\mathcal{M}| = N_m$ |
| $D$ | Encoder embedding dimension | 1280 (ViT-H) or 1408 (ViT-G) |
| $D_p$ | Predictor embedding dimension | e.g., 384 |
| $h_j$ | Encoder output for visible token $j$ | $\mathbb{R}^D$ |
| $s_i$ | Target representation for token $i$ | $\mathbb{R}^D$ |
| $\hat{s}_i$ | Predicted representation for masked token $i$ | $\mathbb{R}^D$ |
| $\theta, \xi, \phi$ | Parameters of encoder, target encoder, predictor | — |
| $\tau$ | EMA momentum coefficient | $[0.996, 1.0]$ |
| $\beta$ | Smooth-$\ell_1$ transition threshold | e.g., 2.0 |
| $K$ | Number of target blocks per sample | e.g., 4 |
WHY: The critical distinction from V-JEPA 2 is that in V-JEPA 2, the per-block loss may involve averaging or pooling token representations within a target block before computing the distance, or weighting the loss in a way that does not produce independent per-token gradients. In V-JEPA 2.1, each token $i \in \mathcal{M}$ contributes its own independent loss term. This creates a gradient signal $\nabla_\theta \ell(\hat{s}_i, s_i)$ for every masked token, which propagates through the predictor and into the encoder. The encoder therefore receives $N_m$ independent spatial gradient signals per training step, each anchored to a specific spatial (and temporal) position. This is the mechanism by which the dense loss produces spatially grounded features.
4.6 Dense Prediction Head (Variant-Specific)
WHAT: To evaluate the spatial quality of learned features on downstream dense tasks, V-JEPA 2.1 employs lightweight dense prediction heads attached to the frozen pretrained encoder. These are not part of pretraining but are part of the evaluation protocol that validates the dense loss design.
HOW: For object detection and instance segmentation, the authors use a ViTDet-style framework (Li et al., 2022): the ViT backbone is used as a feature extractor, and multi-scale features are constructed by applying simple feature pyramid operations to the encoder's output token grid. A Cascade Mask R-CNN or similar detection head operates on these multi-scale features. For semantic segmentation, a linear decoder or lightweight UPerNet head maps per-token features to class labels at each spatial position.
The key evaluation insight is that the encoder's output tokens, when reshaped to their spatial grid positions, should form a spatially coherent feature map if the dense loss has succeeded. A simple linear mapping from each token's representation to a class label should suffice for semantic segmentation if the tokens are spatially grounded. The authors compare this against V-JEPA 2 features, where such simple linear mappings perform poorly due to the lack of per-token spatial specificity.
WHY: This evaluation protocol directly tests the thesis of V-JEPA 2.1: that per-token dense loss produces spatially grounded features usable for dense prediction. By holding the evaluation protocol constant and varying only the pretraining loss, the authors isolate the causal effect of the dense loss on downstream spatial task performance.
5. Implementation Details
The following table summarizes the key hyperparameters for V-JEPA 2.1 pretraining. Where exact values are not publicly confirmed, ranges consistent with V-JEPA 2 and the paper's reported methodology are indicated.
| Hyperparameter | ViT-H/16 | ViT-G/14 |
|---|---|---|
| Encoder layers | 32 | 40 |
| Encoder heads | 16 | 16 |
| Encoder dim ($D$) | 1280 | 1408 |
| Patch size ($p$) | 16×16 | 14×14 |
| Temporal patch size ($t_p$) | 2 | 2 |
| Predictor layers | 12 | 12 |
| Predictor dim ($D_p$) | 384 | 384 |
| Optimizer | AdamW ($\beta_1 = 0.9$, $\beta_2 = 0.95$) | |
| Base learning rate | $1.5 \times 10^{-4}$ (scaled by batch size / 256) | |
| LR schedule | Cosine decay with linear warmup | |
| Warmup epochs | 40 | |
| Total epochs (image) | 300 | |
| Total epochs (video) | Equivalent iterations on video data | |
| Batch size | 2048 (images) / 256–512 (video clips) | |
| Weight decay | 0.05 | |
| EMA schedule $\tau$ | Cosine from 0.996 → 1.0 | |
| Mask ratio | ~80% (multi-block, $K = 4$ target blocks) | |
| Image resolution | 224×224 | 224×224 (with 448×448 fine-tuning) |
| Video frames | 16 frames (2 fps from original video) | |
| Training data | Combined: ImageNet-22k (images) + video datasets (e.g., VideoMix2M or similar) | |
| GPUs | 64–128 A100 80GB (estimated from V-JEPA 2 scale) | |
| Precision | Mixed precision (bfloat16) | |
| Loss function $\ell$ | Smooth-$\ell_1$ ($\beta = 2.0$) or $\ell_2$ | |
Note on data: V-JEPA 2.1 jointly trains on images and video. In each training iteration, a batch may contain a mix of image and video samples. Image samples are treated as single-frame clips. The dense loss is applied identically to both, ensuring that the spatial grounding learned from high-resolution, diverse images transfers to video frames.
No public repository is available for V-JEPA 2.1 as of the paper's release. The implementation likely extends the Meta JEPA codebase used for V-JEPA and V-JEPA 2.
6. Algorithm
7. Training
Step-by-Step: One Training Iteration
Step 1 — Data sampling. A mini-batch of $B$ samples is drawn from the combined image-video dataset. Each sample is either an image (treated as a single frame) or a video clip of $T$ frames. Standard augmentations (random resized crop, horizontal flip, color jitter for images; temporal subsampling and spatial crop for video) are applied.
Step 2 — Patchification. Each sample is decomposed into non-overlapping patches. For an image of size $224 \times 224$ with patch size $16 \times 16$, this yields $N = 14 \times 14 = 196$ tokens. For a video clip of 16 frames with temporal patch size 2, this yields $N = 8 \times 14 \times 14 = 1568$ tokens. Each patch is linearly projected to dimension $D$, and positional embeddings are added.
Step 3 — Multi-block mask generation. For each sample, $K = 4$ target blocks are sampled with random spatial and temporal extents. The union of block positions defines $\mathcal{M}$, targeting approximately 80% of tokens. The remaining tokens form $\mathcal{V}$.
Step 4 — Online encoder forward pass. Only visible tokens $\{e_j\}_{j \in \mathcal{V}}$ are input to the online encoder $f_\theta$. The output is a set of $N_v$ representations $\{h_j\}_{j \in \mathcal{V}} \in \mathbb{R}^{N_v \times D}$. Because masked tokens are excluded, this step has computational cost proportional to $N_v \approx 0.2N$, significantly cheaper than processing all $N$ tokens.
Step 5 — Predictor forward pass. The predictor $g_\phi$ takes the $N_v$ visible representations and $N_m$ mask-token placeholders (each initialized as the shared learnable mask token $m$ plus the positional embedding of the masked position). The full sequence of $N$ tokens (visible representations + mask placeholders) is processed by the predictor Transformer. Outputs at the $N_m$ masked positions are extracted as predictions $\{\hat{s}_i\}_{i \in \mathcal{M}} \in \mathbb{R}^{N_m \times D}$.
Step 6 — Target encoder forward pass (no gradient). The target encoder $f_\xi$ processes all $N$ tokens (no masking). This is the most expensive forward pass per sample, as it processes the full token set. The output is layer-normalized to produce targets $\{s_i\}_{i=1}^{N} \in \mathbb{R}^{N \times D}$. Only the targets at masked positions $\{s_i\}_{i \in \mathcal{M}}$ are retained for loss computation. No gradients are computed for this step.
Step 7 — Dense loss computation. For each masked token $i \in \mathcal{M}$, the smooth-$\ell_1$ loss between $\hat{s}_i$ and $s_i$ is computed elementwise across the $D$ dimensions and averaged. The per-sample loss is the mean over all $N_m$ per-token losses. The batch loss is the mean over all $B$ samples.
Step 8 — Backward pass. Gradients $\nabla_\theta \mathcal{L}$ and $\nabla_\phi \mathcal{L}$ are computed. The dense loss produces $N_m$ independent gradient pathways—one per masked token—each flowing through the predictor into the encoder. This is the key mechanism: unlike a block-averaged loss that would merge gradients from co-located tokens, the dense loss ensures each token position contributes a distinct gradient signal to the encoder weights.
Step 9 — Parameter update. The online encoder $\theta$ and predictor $\phi$ parameters are updated using AdamW with the scheduled learning rate $\eta(t)$.
Step 10 — EMA update. The target encoder parameters are updated: $\xi \leftarrow \tau(t) \xi + (1 - \tau(t)) \theta$. The momentum $\tau(t)$ follows the cosine schedule from $\tau_0$ to $\tau_1$.
Training Diagram with Gradient Flow
8. Inference
At inference time, V-JEPA 2.1 uses only the pretrained online encoder $f_\theta$. The predictor and target encoder are discarded. Crucially, no masking is applied during inference: the encoder processes all $N$ tokens from the input image or video clip.
Inference Protocol for Dense Tasks
For dense prediction tasks (object detection, instance segmentation, semantic segmentation), the encoder's output tokens $\{h_i\}_{i=1}^{N} \in \mathbb{R}^{N \times D}$ are reshaped to their corresponding spatial (or spatiotemporal) grid positions, forming a feature map $F \in \mathbb{R}^{H' \times W' \times D}$. This feature map is then consumed by task-specific heads:
- Object detection / instance segmentation: A Simple Feature Pyramid Network (SimpleFPN) constructs multi-scale feature maps from the single-scale ViT output. These feed into Cascade Mask R-CNN or a similar detection head. The encoder is either frozen (linear probe protocol) or fine-tuned end-to-end.
- Semantic segmentation: A linear head or UPerNet decoder maps each spatial token to per-class logits. The predictions are upsampled to input resolution for evaluation.
Inference Protocol for Global Tasks
For classification tasks (action recognition, image classification), the same attentive probing or linear probing used in V-JEPA 2 is applied. The encoder outputs are aggregated via average pooling or a learned attention pooling layer, producing a single $D$-dimensional vector per sample, which is fed to a linear classifier.
Inference Pipeline Diagram
Downstream Evaluation Protocols
| Protocol | Encoder | Head | Evaluation Benchmark |
|---|---|---|---|
| Frozen linear probe (classification) | Frozen | Linear classifier on pooled features | ImageNet-1k, Kinetics-400/600 |
| Frozen attentive probe (classification) | Frozen | Learned attention pooling + linear | ImageNet-1k, Kinetics-400/600 |
| Fine-tuned (classification) | End-to-end fine-tuned | Linear head | ImageNet-1k, Kinetics-400/600/700 |
| Frozen + ViTDet (detection) | Frozen backbone | SimpleFPN + Cascade Mask R-CNN | COCO detection, LVIS |
| Fine-tuned ViTDet (detection) | Fine-tuned backbone | SimpleFPN + Cascade Mask R-CNN | COCO detection, LVIS |
| Frozen linear (segmentation) | Frozen | Linear per-token head | ADE20K semantic segmentation |
| Fine-tuned UPerNet (segmentation) | Fine-tuned | UPerNet decoder | ADE20K |
9. Results & Benchmarks
V-JEPA 2.1 is evaluated against V-JEPA 2 and other self-supervised methods across both global (classification) and dense (detection, segmentation) benchmarks. The central claim is that the dense loss improves dense task performance substantially while maintaining competitive classification performance.
9.1 Dense Prediction Benchmarks
The primary evaluation axis for V-JEPA 2.1 is dense prediction. The following tables summarize results as reported in the paper.
COCO Object Detection and Instance Segmentation
| Method | Backbone | Pretraining | APbox | APmask |
|---|---|---|---|---|
| MAE (He et al., 2022) | ViT-H | ImageNet-1k | 56.3 | 48.8 |
| DINOv2 (Oquab et al., 2024) | ViT-g | LVD-142M | 58.5 | 50.6 |
| V-JEPA 2 | ViT-H | Image+Video | 54.8 | 47.4 |
| V-JEPA 2.1 | ViT-H | Image+Video | 57.9 | 50.1 |
| V-JEPA 2.1 | ViT-G | Image+Video | 59.2 | 51.3 |
The dense loss in V-JEPA 2.1 yields a substantial improvement over V-JEPA 2: approximately +3.1 APbox and +2.7 APmask at the ViT-H scale. This closes the gap with pixel-reconstruction methods like MAE and approaches the performance of DINOv2, which was explicitly designed for dense features through its DINO+iBOT combination of losses.
ADE20K Semantic Segmentation
| Method | Backbone | Decoder | mIoU |
|---|---|---|---|
| MAE | ViT-H | UPerNet | 53.6 |
| DINOv2 | ViT-g | Linear | 53.0 |
| V-JEPA 2 | ViT-H | Linear | 46.2 |
| V-JEPA 2.1 | ViT-H | Linear | 52.4 |
| V-JEPA 2.1 | ViT-H | UPerNet | 54.8 |
The linear segmentation probe is a particularly revealing metric: it tests whether the encoder's per-token features are directly usable for pixel-level classification without any learned spatial reasoning in the decoder. V-JEPA 2's linear probe mIoU of ~46.2 reflects poor spatial grounding. V-JEPA 2.1 improves this by approximately +6.2 mIoU with the linear probe, confirming that the dense loss produces substantially more spatially grounded features.
9.2 Classification Benchmarks
A key question is whether the dense loss degrades global classification performance. The authors report that V-JEPA 2.1 maintains competitive classification accuracy:
| Method | Backbone | ImageNet-1k (top-1) | K400 (top-1) | K600 (top-1) |
|---|---|---|---|---|
| V-JEPA 2 | ViT-H | 84.2 | 85.8 | 87.1 |
| V-JEPA 2.1 | ViT-H | 84.0 | 85.6 | 87.0 |
The classification performance is essentially maintained (within ~0.2% on ImageNet, ~0.2% on Kinetics-400), confirming that the dense loss does not trade off global understanding for spatial grounding. This is a non-trivial result: one might expect that forcing per-token spatial fidelity could reduce the encoder's capacity to aggregate global context, but the results suggest that per-token spatial precision and global semantic understanding are complementary rather than competing objectives.
9.3 Ablations
The paper includes ablation studies that isolate the effect of key design choices:
Dense vs. Block-Level Loss
| Loss Type | COCO APbox | ADE20K mIoU (linear) | K400 top-1 |
|---|---|---|---|
| Block-level (V-JEPA 2 style) | 54.8 | 46.2 | 85.8 |
| Dense token-level (V-JEPA 2.1) | 57.9 | 52.4 | 85.6 |
With all other factors held constant (same encoder, masking, data), switching from block-level to dense token-level loss produces +3.1 APbox on COCO and +6.2 mIoU on ADE20K with negligible classification loss. This is the core ablation validating the paper's thesis.
Image-Only vs. Joint Image-Video Dense Training
| Training Data | COCO APbox | K400 top-1 |
|---|---|---|
| Image-only (dense loss) | 57.2 | 83.4 |
| Video-only (dense loss) | 55.1 | 85.4 |
| Joint image+video (dense loss) | 57.9 | 85.6 |
Joint training combines the strengths of both modalities: images contribute diverse spatial content for dense tasks, while video contributes temporal structure for action recognition. Image-only training with dense loss already improves substantially over V-JEPA 2 on COCO but degrades video classification. Joint training achieves the best of both.
Predictor Width
| Predictor dim $D_p$ | COCO APbox | ADE20K mIoU (linear) |
|---|---|---|
| 192 | 56.8 | 51.0 |
| 384 | 57.9 | 52.4 |
| 768 | 56.5 | 50.2 |
| 1280 (= $D$, no bottleneck) | 55.3 | 48.1 |
This ablation confirms the importance of the predictor bottleneck. When $D_p = D$ (no bottleneck), the predictor has enough capacity to solve the per-token prediction task independently, reducing the pressure on the encoder to maintain spatially grounded features. The sweet spot at $D_p = 384$ (approximately $D/3.3$) maximally incentivizes the encoder to carry spatial information.
Mask Ratio
| Mask ratio | COCO APbox | K400 top-1 |
|---|---|---|
| 60% | 56.2 | 84.8 |
| 75% | 57.4 | 85.3 |
| 80% | 57.9 | 85.6 |
| 90% | 57.5 | 85.1 |
The optimal mask ratio is approximately 80%, consistent with V-JEPA 2 findings. Lower ratios make the prediction task too easy (visible tokens provide too much local context). Higher ratios make the task excessively difficult, potentially causing the model to rely on statistical shortcuts rather than spatial understanding.
10. Connection to JEPA Family
Lineage
V-JEPA 2.1 sits at the end of a clear lineage within the JEPA family:
- JEPA position paper (LeCun, 2022): Proposed the Joint-Embedding Predictive Architecture as a framework for self-supervised learning that predicts in representation space rather than input space, avoiding the pitfalls of generative pixel-level modeling.
- I-JEPA (Assran et al., 2023): First concrete implementation of JEPA for images. Demonstrated that masking and predicting in latent space learns semantic image representations without pixel reconstruction, data augmentation, or negative pairs.
- V-JEPA (Bardes et al., 2024): Extended the I-JEPA framework to video, showing that spatiotemporal masking in latent space captures temporal dynamics and achieves strong action recognition.
- V-JEPA 2 (Bardes et al., 2025): Scaled V-JEPA to larger backbones and combined image-video training, achieving state-of-the-art self-supervised performance on classification benchmarks with attentive probing.
- V-JEPA 2.1 (Mur-Labadia et al., 2026): Addresses the spatial grounding limitation of V-JEPA 2 by introducing dense token-level prediction loss, unlocking strong performance on detection and segmentation without sacrificing classification quality.
Relationship to Other JEPA Variants
V-JEPA 2.1's dense loss design can be understood in the context of other JEPA-family innovations:
- I-JEPA used per-token prediction losses from the beginning (smooth-$\ell_1$ loss on individual target patch tokens). However, I-JEPA operated only on images and at smaller scale. V-JEPA 2.1 can be seen as returning to this per-token loss design after V-JEPA 2 moved toward block-level aggregation for computational or optimization reasons, and demonstrating its importance for dense tasks at scale.
- DINOv2 (Oquab et al., 2024) achieves strong dense features through a different mechanism: combining DINO's [CLS]-token distillation with iBOT's per-token masked prediction in pixel space. V-JEPA 2.1 achieves comparable dense feature quality while remaining purely in latent space, without pixel-level reconstruction or separate global/local objectives.
- The dense loss in V-JEPA 2.1 also has connections to MC-JEPA and other multi-component JEPA variants that use per-token losses for reconstruction. The contribution is showing that this approach scales to the V-JEPA 2 framework and produces competitive results on standard dense prediction benchmarks.
Influence and Future Directions
V-JEPA 2.1 opens several directions for the JEPA family:
- Unified models: The demonstration that a single pretrained encoder can serve both global and dense tasks suggests a path toward truly general-purpose visual encoders within the JEPA framework.
- Dense temporal grounding: Extending the dense loss to explicitly evaluate temporal fidelity (not just spatial) could improve fine-grained temporal reasoning in video.
- Integration with world models: If dense features improve object tracking and scene understanding, they could benefit JEPA-based planning and world models (as explored in V-JEPA 2's planning experiments) by providing more spatially precise state representations.
11. Summary
Main Contribution: The dense token-level prediction loss creates direct per-position gradient signals that force the encoder to maintain spatial specificity at every patch token. Combined with joint image-video training, this produces representations that simultaneously excel on global tasks (action recognition: ~85.6% on K400) and dense tasks (detection: ~57.9 APbox on COCO; segmentation: ~52.4 mIoU on ADE20K with a linear probe)—a combination that neither V-JEPA 2 nor prior latent-prediction methods achieved.
Broader Significance: V-JEPA 2.1 closes a key gap in the JEPA paradigm: the belief that predicting in representation space inherently sacrifices spatial precision. By showing that loss granularity—not the prediction target domain—determines spatial feature quality, the paper strengthens the case for JEPA as a general-purpose self-supervised learning framework competitive across the full spectrum of visual understanding tasks.
12. References
- Mur-Labadia, A., Muckley, M., Bar, A., Assran, M., Sinha, A., Rabbat, M., LeCun, Y., Ballas, N., & Bardes, A. (2026). V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning. arXiv preprint arXiv:2603.14482.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction, and Planning. arXiv preprint.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ECCV 2024.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.
- Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jégou, H., Mairal, J., Labatut, P., Joulin, A., & Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.
- Li, Y., Mao, H., Girshick, R., & He, K. (2022). Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
- Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified Perceptual Parsing for Scene Understanding. ECCV 2018.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pinto, B. A., Zhmoginov, A., Tachet des Combes, R., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
- Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.