AuthorsBardes, Garrido, Ponce, Chen, Rabbat, LeCun, Assran, Ballas
Date2024-04
CategoryVideo
Derives fromI-JEPA
Score6.6/10 — Draft

Video JEPA (V-JEPA)

1. Introduction

Self-supervised learning from video carries a fundamental promise: video provides a dense, naturally occurring signal about how the visual world changes over time, and any system that can predict future visual states from past observations must, implicitly, understand object permanence, motion dynamics, and scene semantics. Yet the dominant approaches to self-supervised video representation learning have struggled to deliver on this promise. Pixel-reconstruction methods such as VideoMAE (Tong et al., 2022) and MAE-ST (Feichtenhofer et al., 2022) learn by reconstructing masked spatiotemporal patches in raw pixel space, forcing the encoder to waste capacity on unpredictable low-level details — the exact texture of a shirt, the noise in a shadow, the precise value of every RGB pixel. Contrastive methods such as BEVT and ViCLR learn temporal invariances through augmentation-driven objectives, but require careful design of temporal and spatial augmentation pipelines that inevitably inject inductive biases about what should and should not be invariant. Language-supervised methods like VideoCLIP and InternVideo achieve strong zero-shot performance but are fundamentally dependent on paired text data, limiting them to domains where text supervision exists and biasing representations toward concepts expressible in language.

V-JEPA (Video Joint-Embedding Predictive Architecture), introduced by Bardes, Garrido, Ponce, Chen, Rabbat, LeCun, Assran, and Ballas in February 2024, resolves all three limitations by extending the JEPA paradigm from static images to video. V-JEPA learns by predicting the abstract representations of masked spatiotemporal regions in a learned latent space, using no pixel reconstruction, no pretrained image encoders, no text supervision, no negative examples, and no human annotations of any kind. This single principle — predict features, not pixels — yields representations that capture both appearance and motion, work across video and image tasks, and do so with substantially less compute than pixel-reconstruction alternatives.

V-JEPA directly extends I-JEPA (Assran et al., 2023), the image-based instantiation of the JEPA framework. However, the transition from images to video introduces several non-trivial challenges that V-JEPA addresses through three key design decisions:

  1. Spatiotemporal multi-block masking: I-JEPA masks 2D spatial blocks on a flat patch grid. V-JEPA operates on a 3D spatiotemporal token grid and employs a masking strategy where target blocks span the full temporal extent of the clip — every frame — but cover only a fraction of the spatial area. This design forces the predictor to reason about spatial relationships that persist across time, rather than trivially interpolating from nearby frames.
  2. Video Vision Transformer with tubelet embedding: V-JEPA uses a 3D patch embedding (tubelet size 2) that fuses pairs of consecutive frames at the input, producing a compact spatiotemporal token sequence. Combined with 3D sinusoidal positional embeddings and standard transformer blocks, this yields an architecture that processes video natively rather than treating it as a bag of independent frames.
  3. Frozen evaluation protocol: V-JEPA is evaluated primarily with the encoder weights completely frozen, using an attentive probing mechanism (a learnable cross-attention query attending to frozen features). This protocol is a stricter test of representation quality than fine-tuning, since the encoder cannot compensate for poor features by adapting to the downstream task.

The results are compelling. V-JEPA's largest model — a ViT-H/16 trained on 2 million publicly sourced videos — achieves 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K, all with a frozen backbone. On motion-heavy benchmarks like Something-Something-v2, V-JEPA outperforms image-pretrained models (DINOv2: 50.6%, OpenCLIP: 34.8%) by over 20 points, demonstrating that video pretraining captures temporal dynamics that image-only methods fundamentally miss. In label-efficient regimes (5% labeled data), V-JEPA ViT-H achieves 67.0% on K400 versus VideoMAE ViT-H's 62.3% — a gap that highlights the superior semantic quality of feature prediction over pixel reconstruction. Furthermore, V-JEPA processes only 270M total video samples across 90K training iterations, compared to VideoMAE's 410M samples in 400K iterations and Hiera's 770M in 1.5M iterations, representing a roughly 2× improvement in sample and compute efficiency.

This article provides a complete treatment of V-JEPA. Section 2 explains the method with accessible intuitions. Section 3 gives a model overview with an at-a-glance table and architectural diagram. Section 4 dissects each component — encoder, target encoder, predictor, spatiotemporal masking, and loss function — with full mathematical formulations, hyperparameter details, and ablation evidence. Section 5 provides exhaustive implementation details. Section 6 presents formal algorithms for training and inference. Section 7 walks through one training iteration with annotated diagrams. Section 8 describes the inference and downstream evaluation pipeline. Section 9 reports benchmark results and ablation studies. Section 10 situates V-JEPA in the JEPA family lineage, and Section 11 summarizes key takeaways.

2. Method

The core idea of V-JEPA can be stated simply: given a short video clip, mask large spatiotemporal regions of the patch token grid, encode only the visible tokens, and predict the representations of the masked tokens — as produced by a slowly-updating target encoder — without ever reconstructing pixels.

Think of it this way: Imagine watching a video of someone cooking through a window with several large opaque patches taped on the glass. These patches block different spatial regions but remain fixed across all frames — you can never see the countertop in the lower-left, for instance, throughout the entire clip. A pixel-reconstruction approach (VideoMAE) would ask you to paint in every missing pixel for every blocked region in every frame — the exact color of the countertop tile, the precise RGB of the steam, the reflection on the knife blade. V-JEPA instead asks: "describe what is happening in each blocked region across the clip, at an abstract level." You might say "the lower-left region contains a cutting board where a hand is slicing onions" — you understand the spatial content and the temporal dynamics without needing to reproduce any pixels. The key insight is that these masked regions span all frames, so you cannot simply "copy" a visible frame to fill them in; you must understand the scene's spatiotemporal structure.

More concretely, V-JEPA's training procedure for a single video clip involves five steps:

  1. Tubelet embedding: A video clip of $F$ frames at spatial resolution $H \times W$ is divided into non-overlapping 3D patches (tubelets) of size $t_s \times p \times p$ (temporal × height × width), where $t_s = 2$ and $p = 16$. Each tubelet is linearly projected to an embedding vector of dimension $D$. For a clip of 16 frames at 224×224, this yields $T \times H_g \times W_g = 8 \times 14 \times 14 = 1568$ tokens.
  2. Spatiotemporal masking: Two types of target masks are generated on the 3D token grid. "Short-range" masks consist of 8 small blocks, each covering 15% of the spatial area and spanning the full temporal dimension. "Long-range" masks consist of 2 large blocks, each covering 70% of the spatial area and again spanning full temporal extent. The union of all target blocks leaves approximately 10% of tokens as visible context.
  3. Context encoding: Only the visible context tokens (the ~10% not covered by any mask) are fed through the online context encoder — a Video Vision Transformer — producing patch-level representations.
  4. Latent prediction: A smaller predictor transformer receives the context representations and learnable mask tokens positioned at target locations (with appropriate 3D positional embeddings). It predicts the representations of all masked tokens.
  5. Loss computation: Simultaneously, the entire unmasked clip (all 1568 tokens) is passed through the target encoder — an exponential moving average (EMA) copy of the context encoder with stop-gradient — producing ground-truth representations for the target positions. The L1 distance between predicted and actual target representations is minimized.

Gradients flow only through the context encoder and predictor. The target encoder receives no gradient signal and is updated solely via EMA. This asymmetric design, combined with the information bottleneck of the narrow predictor, prevents representational collapse without requiring contrastive negatives, architectural tricks, or explicit regularization.

Why full-temporal-extent masks? A natural alternative would be to mask random spatiotemporal cubes (small in both space and time). V-JEPA's authors found that this makes prediction too easy: the model can reconstruct a briefly-masked region by interpolating from adjacent visible frames. By masking spatial regions across all frames in the clip, V-JEPA forces the predictor to reason about the spatial structure of the scene — what object occupies a given region — because temporal copying is impossible. The ablations confirm this: random tube masking yields only 51.5% on K400 compared to multi-block masking's 72.9%.
Feature prediction vs. pixel prediction: Predicting in latent space rather than pixel space has two concrete advantages. First, the target encoder implicitly denoises the prediction targets — it learns to produce representations that capture stable semantic content while discarding unpredictable low-level variation (lighting noise, compression artifacts, motion blur). The predictor can focus on what matters rather than wasting capacity on irrelevant detail. Second, since the target space is a compressed, learned representation rather than raw pixels, the prediction task is computationally cheaper — no pixel decoder is needed, and the loss computation is over D-dimensional vectors rather than $p^2 \times t_s \times 3$ pixel blocks. Ablations show feature prediction outperforms pixel reconstruction by 5.1 points on K400 (73.7% vs. 68.6%) and 1.5 points on ImageNet-1K (74.8% vs. 73.3%) at the ViT-L scale.

3. Model Overview

Architecture at a Glance

PropertyValue
Input typeVideo frames → 3D tubelet patches (16 frames, patch 16×16, tubelet 2)
Masking strategySpatiotemporal multi-block: 8 short-range (15% spatial, full temporal) + 2 long-range (70% spatial, full temporal); ~90% tokens masked
Encoder architectureVideo Vision Transformer (ViT-L/16 or ViT-H/16)
Predictor typeNarrow Vision Transformer (12 layers, 384-dim, 12 heads)
Loss functionL1 in representation space: $\frac{1}{|\mathcal{T}|}\sum_{i \in \mathcal{T}} \| \hat{s}_i - \text{sg}(s_i) \|_1$
Key result (frozen eval)81.9% K400 · 72.2% SSv2 · 77.9% IN1K (ViT-H/16-384)
ParametersViT-L: ~307M encoder + ~38M predictor; ViT-H: ~632M encoder + ~47M predictor

Training Architecture Diagram

V-JEPA Training Architecture Video Clip 16 frames 16×224×224×3 frame_step=4 Tubelet Embed 2×16×16 N=8×14×14=1568 3D Multi-block Masking (~90%) context ~157 tokens all 1568 tokens Context Encoder Video ViT (trainable) ViT-H: 32L, D=1280, 16H Target Encoder Video ViT (EMA) stop-gradient EMA m: 0.998→1.0 B×N_c×D Predictor Narrow ViT (trainable) 12L, D_p=384, 12H Mask Tokens (×2) + 3D PosEmbed B×M×D Predicted ŝ_y B×N×D Target LN(sg(s_y)) extract target positions → B×M×D L1 Loss |ŝ_y - sg(s_y)| ━━ trainable (gradient flows) ╌╌ frozen / EMA (no gradient) ∇θ, ∇φ (gradient flow to encoder + predictor) 3D Sinusoidal PosEmbed
Figure 1. V-JEPA training architecture. A video clip is embedded into 1568 spatiotemporal tokens. Approximately 90% are masked. The context encoder processes only visible tokens (~157). The predictor takes context representations plus learnable mask tokens and predicts target representations. The target encoder (EMA copy, no gradient) processes all tokens and provides L1 regression targets after layer normalization.

4. Main Components of V-JEPA

4.1 Encoder (Video Vision Transformer)

WHAT: The encoder is a standard Vision Transformer adapted for video through 3D patch embedding and 3D positional embeddings. It is the component that produces learned representations and whose weights are ultimately used for downstream tasks.

HOW: The encoder uses PatchEmbed3D — implemented as a single nn.Conv3d with kernel size $(t_s, p, p) = (2, 16, 16)$ and stride equal to kernel size — to project each spatiotemporal tubelet into a $D$-dimensional embedding. For ViT-L, $D = 1024$ with 24 transformer blocks and 16 attention heads. For ViT-H, $D = 1280$ with 32 blocks and 16 heads. Both use an MLP ratio of 4.0 and QKV bias. The architecture uses no [CLS] token — all computation operates on patch tokens only.

Positional embeddings are 3D sinusoidal (non-learnable), computed via get_3d_sincos_pos_embed with uniform_power=True, which allocates $\lceil D/6 \rceil \times 2$ dimensions per spatial/temporal axis and concatenates them (truncated to $D$). This allows interpolation to different spatial resolutions (e.g., 384) via trilinear interpolation.

Weight initialization follows a truncated normal distribution with $\sigma = 0.02$, with a critical residual rescaling: the attention output projection and MLP second-layer weights in block $l$ are rescaled by $1/\sqrt{2l}$, improving training stability for deep networks.

During training, the encoder receives only the unmasked context tokens (not the full token sequence). This is handled by MultiMaskWrapper, which calls apply_masks(x, masks) to gather only the visible token indices via torch.gather before feeding them through the transformer blocks. This yields significant compute savings: processing ~10% of tokens rather than 100%.

WHY: The no-[CLS]-token design is a deliberate choice inherited from I-JEPA. Since V-JEPA's objective is patch-level prediction (predicting representations of specific spatial locations), a global [CLS] token would add no benefit — all information must be spatially resolved. The aggressive residual rescaling ($1/\sqrt{2l}$) is necessary for stable training at depth 32 (ViT-H) with the large effective batch sizes used (3072).

4.2 Target Encoder (EMA)

WHAT: The target encoder is an identical copy of the context encoder whose parameters are updated via exponential moving average (EMA) of the online encoder's parameters, with a complete stop-gradient. It produces the regression targets for the predictor.

HOW: The target encoder is initialized as copy.deepcopy(encoder) with all parameters set to requires_grad=False. After each optimizer step, it is updated as:

$$\bar{\theta}_t \leftarrow m_t \cdot \bar{\theta}_{t-1} + (1 - m_t) \cdot \theta_t$$

where the momentum $m_t$ follows a linear schedule from $m_0 = 0.998$ to $m_T = 1.0$:

$$m_t = m_0 + \frac{t}{T}(m_T - m_0)$$

with $T = \text{ipe} \times \text{epochs} \times \text{ipe\_scale} = 300 \times 300 \times 1.25 = 112{,}500$ total steps. The target encoder processes the complete token sequence (all 1568 tokens, no masking), and its output is layer-normalized before being used as regression targets: $s_i = \text{LayerNorm}(\bar{f}_{\bar{\theta}}(x)_i)$.

WHY: The EMA target provides a slowly-evolving, stable regression target that prevents the collapse mode where the encoder could trivially satisfy the prediction objective by producing constant representations. As $m_t$ increases toward 1.0, the target encoder becomes increasingly frozen, which stabilizes late-stage training. Layer normalization of target features is applied in the loss computation (not inside the target encoder architecture) and ensures that the L1 loss operates on scale-normalized features, preventing the trivial collapse solution of shrinking all representations toward zero. The choice of L1 over L2 loss (as used in I-JEPA) was found to provide better training stability for video.

4.3 Predictor

WHAT: The predictor is a narrow transformer that maps from context encoder outputs to predictions of target encoder outputs at masked positions. Its narrowness is a critical bottleneck that forces the encoder to learn rich, spatially-resolved features.

HOW: The predictor (VisionTransformerPredictor) has 12 transformer blocks at dimension $D_p = 384$ with 12 attention heads — substantially narrower than the encoder (1024 or 1280). It operates as follows:

  1. Context representations are linearly projected: $D \to D_p$ via predictor_embed
  2. Learnable mask tokens ($D_p$-dimensional, zero-initialized) are placed at target positions. V-JEPA uses 2 distinct mask tokens — one per mask generator — selected via mask_index in PredictorMultiMaskWrapper
  3. 3D sinusoidal positional embeddings (same configuration as encoder) are added to both context tokens and mask tokens
  4. The concatenated sequence $[\hat{h}_{\text{ctx}}; m_1, \ldots, m_M]$ passes through all 12 transformer blocks
  5. Only the positions corresponding to target tokens are extracted from the output
  6. A final linear projection maps $D_p \to D$ via predictor_proj

WHY: The narrow bottleneck ($D_p = 384$ vs. $D = 1280$) is essential for preventing a degenerate solution. If the predictor were as wide as the encoder, it could in principle memorize a lookup table mapping positional embeddings to representations, making the encoder's job trivial. The narrow predictor forces the encoder to produce context representations that are information-rich enough to support prediction despite the bottleneck. The use of 2 distinct mask tokens corresponds to the 2 mask generators (short-range and long-range), allowing the predictor to distinguish which masking pattern produced each target set.

4.4 Spatiotemporal Masking Strategy

WHAT: V-JEPA employs a multi-block 3D masking strategy that generates two types of target masks on the spatiotemporal token grid. The masks determine which tokens the predictor must predict and which tokens the encoder sees as context.

HOW: The masking is implemented in MaskCollator (from src/masks/multiblock3d.py), which operates on a 3D grid of shape $(T_d, H_g, W_g) = (8, 14, 14)$ for a 16-frame, 224×224 clip. Two mask generators are configured:

Short-range masks (Generator 1): 8 blocks, each covering exactly 15% of the spatial area ($14 \times 14 = 196$ spatial positions × 0.15 ≈ 29 positions per block per frame), spanning the full temporal extent ($T_d = 8$). Aspect ratio sampled uniformly from $[0.75, 1.5]$.

Long-range masks (Generator 2): 2 blocks, each covering 70% of the spatial area ($196 \times 0.70 ≈ 137$ positions per block per frame), spanning full temporal extent. Aspect ratio also from $[0.75, 1.5]$.

For each block, the generation algorithm:

  1. Samples temporal scale $s_t \sim \text{Uniform}(1.0, 1.0)$ → always the full temporal dimension
  2. Samples spatial scale $s_s \sim \text{Uniform}(s_{\min}, s_{\max})$ (e.g., [0.15, 0.15] or [0.7, 0.7])
  3. Samples aspect ratio $a \sim \text{Uniform}(0.75, 1.5)$
  4. Computes block dimensions: $h = \lfloor\sqrt{s_s \cdot H_g \cdot W_g \cdot a}\rfloor$, $w = \lfloor\sqrt{s_s \cdot H_g \cdot W_g / a}\rfloor$, $t = \lceil s_t \cdot T_d \rceil$
  5. Samples a random center position and clips to grid boundaries

The target mask ($\mathcal{T}$) consists of all token indices within any sampled block. The context mask ($\mathcal{C}$) is the complement: $\mathcal{C} = \{1, \ldots, N\} \setminus \mathcal{T}$. The resulting masking ratio is approximately 90%, leaving roughly 157 visible tokens out of 1568.

Critically, the block sampling uses a shared seed across the batch (via multiprocessing.Value for thread-safe step counting), ensuring deterministic masking that is consistent across DataLoader workers but varies across training steps.

V-JEPA Spatiotemporal Masking Strategy t=0 t=2 t=4 t=6 ← same spatial mask applied at every temporal position → Short-range blocks (×8) 15% spatial, full temporal Long-range blocks (×2) 70% spatial, full temporal Context (visible ~10%) Masking Strategy Comparison (ablation) Random Tube K400: 51.5% Causal [6] K400: 61.3% Causal [12] K400: 71.9% Multi-block ✓ K400: 72.9% SSv2: 46.4% SSv2: 49.8% SSv2: 63.6% SSv2: 67.4% Full-temporal-extent multi-block masking prevents trivial temporal interpolation and forces the encoder to learn spatially-resolved scene understanding.
Figure 2. V-JEPA spatiotemporal masking. Top: the same spatial mask pattern persists across all temporal positions — target blocks span the full clip duration. Bottom: ablation of masking strategies. Multi-block masking outperforms random tube and causal alternatives by large margins (Table 4 from the paper).

WHY: The full-temporal-extent design is the key architectural insight of V-JEPA. If masks spanned only a few frames, the predictor could reconstruct the missing region by copying from adjacent visible frames at the same spatial position — a strategy that requires no semantic understanding. By masking the same spatial region across all temporal positions, V-JEPA eliminates this shortcut and forces the model to predict what object or scene element occupies a given spatial location. The ablations confirm this decisively: random tube masking at 90% achieves only 51.5% on K400, while multi-block masking achieves 72.9% — a gap of 21.4 points.

4.5 Loss Function

WHAT: V-JEPA minimizes the L1 distance between predicted representations and layer-normalized target representations in the latent space of the target encoder.

HOW: Given a batch of $B$ video clips, let $\mathcal{T}_k$ denote the set of target token indices for the $k$-th mask generator ($k \in \{1, 2\}$), with $|\mathcal{T}_k| = M_k$. For each mask generator $k$:

  • $\hat{s}^{(k)}_i \in \mathbb{R}^D$: the predictor's output at target position $i \in \mathcal{T}_k$, projected from $D_p$ back to $D$
  • $s^{(k)}_i \in \mathbb{R}^D$: the target encoder's output at position $i$, after layer normalization: $s^{(k)}_i = \text{LayerNorm}(\bar{f}_{\bar{\theta}}(x)_i)$

The JEPA loss is:

$$\mathcal{L}_{\text{JEPA}} = \frac{1}{K} \sum_{k=1}^{K} \frac{1}{M_k} \sum_{i \in \mathcal{T}_k} \left\| \hat{s}^{(k)}_i - \text{sg}(s^{(k)}_i) \right\|_1$$

where $K = 2$ is the number of mask generators, $\text{sg}(\cdot)$ denotes stop-gradient, and $\| \cdot \|_1$ denotes the element-wise L1 norm (mean of absolute values across the $D$ dimensions). In code, this is implemented as:

def loss_fn(z, h):
    """z: list of predicted reps, h: list of target reps (layer-normed)."""
    loss = 0.
    for zi, hi in zip(z, h):
        loss += torch.mean(torch.abs(zi - hi))  # loss_exp=1.0 → L1
    loss /= len(z)  # Average over K mask generators
    return loss

The codebase also implements a variance regularization term:

$$\mathcal{L}_{\text{reg}} = \frac{1}{K}\sum_{k=1}^{K} \text{mean}\left(\text{ReLU}\left(1 - \sqrt{\text{Var}_{i \in \mathcal{T}_k}(\hat{s}^{(k)}_i) + \epsilon}\right)\right)$$

where $\text{Var}_{i \in \mathcal{T}_k}$ computes the variance across the spatial/temporal (patch) dimension and $\epsilon = 10^{-4}$. This encourages the per-dimension standard deviation of predictions to be at least 1.0, preventing variance collapse. However, in all released configurations, reg_coeff = 0.0, meaning this regularization is disabled in practice — the EMA + stop-gradient + narrow predictor combination is sufficient to prevent collapse without explicit regularization.

The total loss is therefore simply:

$$\mathcal{L} = \mathcal{L}_{\text{JEPA}}$$

WHY: V-JEPA uses L1 loss rather than I-JEPA's L2 loss. L1 is more robust to outliers and distributional shifts in the target representations, which are more likely in video (where consecutive frames can have rapid appearance changes) than in static images. The layer normalization applied to targets normalizes the scale of each target vector to have zero mean and unit variance across the $D$ dimension, ensuring that the L1 loss treats all dimensions equally and preventing the model from trivially reducing loss by shrinking representation magnitudes. The fact that variance regularization is unnecessary ($\text{reg\_coeff} = 0.0$) is notable — it empirically demonstrates that V-JEPA's architectural design (EMA target, narrow predictor, aggressive masking) sufficiently prevents collapse without any explicit regularization loss term.

5. Implementation Details

Pretraining Hyperparameters

ParameterViT-L/16 (224)ViT-H/16 (224)ViT-H/16 (384)
Encoder layers243232
Encoder embed dim102412801280
Encoder heads161616
Encoder MLP ratio4.04.04.0
Patch size16 × 1616 × 1616 × 16
Tubelet size222
Input frames16 (frame_step=4)16 (frame_step=4)16 (frame_step=4)
Crop size224224384
Num tokens (N)8×14×14 = 15688×14×14 = 15688×24×24 = 4608
Predictor layers121212
Predictor embed dim384384384
Predictor heads121212
Num mask tokens222
OptimizerAdamW (β₁=0.9, β₂=0.999, ε=1e-8)AdamWAdamW
Peak LR6.25×10⁻⁴6.25×10⁻⁴6.25×10⁻⁴
Start LR2.0×10⁻⁴2.0×10⁻⁴2.0×10⁻⁴
Final LR1.0×10⁻⁶1.0×10⁻⁶1.0×10⁻⁶
LR scheduleLinear warmup → cosine decaysamesame
Warmup epochs404040
Weight decay (init→final)0.04 → 0.4 (cosine)0.04 → 0.40.04 → 0.4
Gradient clip10.010.010.0
Batch size (per GPU)242410
GPUs128 (16 nodes × 8)128 (16 nodes × 8)240 (30 nodes × 8)
Effective batch size307230722400
Epochs300300300
Iterations per epoch300300300
ipe_scale1.251.251.25
Total iterations~90,000~90,000~90,000
Total samples processed~270M~270M~216M
EMA schedule0.998 → 1.0 (linear)0.998 → 1.00.998 → 1.0
Loss exponent1.0 (L1)1.0 (L1)1.0 (L1)
Reg coeff0.0 (disabled)0.00.0
Mixed precisionbfloat16bfloat16bfloat16
Positional embed3D sinusoidal (uniform_power)samesame
SDPATrueTrueTrue

Training Data: VideoMix2M

V-JEPA trains on VideoMix2M, a collection of approximately 2 million publicly available videos compiled from three sources:

Source~VideosDescription
Kinetics-400/600/700 (merged as K710)~700KAction recognition clips
Something-Something-v2~220KObject manipulation / motion
HowTo100M~1.1MInstructional videos (no text used)

Overlapping videos with downstream evaluation validation/test sets are removed. Data augmentation is minimal: random resized crop (scale $[0.3, 1.0]$, aspect ratio $[0.75, 1.35]$) and ImageNet normalization (mean=$(0.485, 0.456, 0.406)$, std=$(0.229, 0.224, 0.225)$). No auto-augmentation, no random erasing, no motion shift.

Key Class Names (from repository)

ComponentClassFile
EncoderVisionTransformersrc/models/vision_transformer.py
3D Patch EmbedPatchEmbed3Dsrc/models/utils/patch_embed.py
PredictorVisionTransformerPredictorsrc/models/predictor.py
Encoder WrapperMultiMaskWrappersrc/models/utils/multimask.py
Predictor WrapperPredictorMultiMaskWrappersrc/models/utils/multimask.py
3D Mask CollatorMaskCollatorsrc/masks/multiblock3d.py
Attentive ProbeAttentiveClassifiersrc/models/attentive_pooler.py
LR ScheduleWarmupCosineSchedulesrc/utils/schedulers.py
WD ScheduleCosineWDSchedulesrc/utils/schedulers.py
Training Loop(procedural)app/vjepa/train.py

6. Algorithm

Algorithm 1: V-JEPA Training (One Iteration)
Input: Video dataset $\mathcal{D}$; context encoder $f_\theta$ (ViT); target encoder $f_{\bar{\theta}}$ (EMA copy); predictor $g_\phi$; EMA momentum schedule $m(t)$; LR schedule $\eta(t)$; WD schedule $\lambda(t)$; 3D mask collator $\mathcal{M}$ with $K=2$ generators
Output: Updated parameters $\theta$, $\phi$, $\bar{\theta}$
1 for each mini-batch $\{v_1, \ldots, v_B\} \sim \mathcal{D}$ do
2 // Data loading and augmentation
3 for each video $v_b$ do
4 Sample 16 frames with frame_step=4 via Decord
5 Apply random resized crop (scale $[0.3, 1.0]$, ratio $[0.75, 1.35]$) → $x_b \in \mathbb{R}^{16 \times 224 \times 224 \times 3}$
6 Normalize with ImageNet statistics
7 end for
8 // Generate masks via Algorithm 2
9 for $k = 1, \ldots, K$ do
10 $(\mathcal{C}_k, \mathcal{T}_k) \leftarrow \mathcal{M}_k.\text{sample}(T_d{=}8, H_g{=}14, W_g{=}14)$   // context and target indices
11 end for
12 $\mathcal{C} \leftarrow \mathcal{C}_1 \cap \mathcal{C}_2$   // context = intersection of complements of all target masks
13 // Tubelet embedding (shared for both encoder paths)
14 $X = \text{PatchEmbed3D}(\{x_b\}) + \text{pos\_embed}_{3D}$   // B×1568×D
15 // Target encoder forward (no gradient)
16 with torch.no_grad():
17 $H_{\text{tgt}} = f_{\bar{\theta}}(X)$   // Full forward, all 1568 tokens → B×1568×D
18 $H_{\text{tgt}} \leftarrow \text{LayerNorm}(H_{\text{tgt}})$   // Normalize target features
19 for $k = 1, \ldots, K$ do
20 $h^{(k)}_{\text{tgt}} = \text{gather}(H_{\text{tgt}}, \mathcal{T}_k)$   // B×M_k×D
21 end for
22 // Context encoder forward (with gradient)
23 $X_{\text{ctx}} = \text{gather}(X, \mathcal{C})$   // B×|C|×D (only visible tokens)
24 $z_{\text{ctx}} = f_\theta(X_{\text{ctx}})$   // B×|C|×D
25 // Predictor forward (with gradient), per mask generator
26 $\mathcal{L} \leftarrow 0$
27 for $k = 1, \ldots, K$ do
28 $\hat{z}_{\text{ctx}} = \text{Linear}_{D \to D_p}(z_{\text{ctx}})$   // Project to predictor dim → B×|C|×384
29 Create mask tokens $\{m^{(k)}_i\}_{i \in \mathcal{T}_k}$ using learnable token $k$, add 3D pos embed
30 $\hat{s}^{(k)} = g_\phi([\hat{z}_{\text{ctx}}; m^{(k)}_1, \ldots, m^{(k)}_{M_k}])[:, |\mathcal{C}|:]$   // Extract target positions → B×M_k×D_p
31 $\hat{s}^{(k)} = \text{Linear}_{D_p \to D}(\text{LayerNorm}(\hat{s}^{(k)}))$   // Project back → B×M_k×D
32 $\mathcal{L} \leftarrow \mathcal{L} + \frac{1}{M_k}\sum_{i=1}^{M_k} \| \hat{s}^{(k)}_i - \text{sg}(h^{(k)}_{\text{tgt},i}) \|_1$   // L1 loss
33 end for
34 $\mathcal{L} \leftarrow \mathcal{L} / K$   // Average over mask generators
35 // Backward and update
36 Compute $\nabla_\theta \mathcal{L}$, $\nabla_\phi \mathcal{L}$   // Gradients w.r.t. encoder and predictor
37 Clip gradients: $\|\nabla\| \leq 10.0$
38 AdamW update: $\theta \leftarrow \theta - \eta(t) \cdot \text{Adam}(\nabla_\theta \mathcal{L})$ with $\lambda(t)$ weight decay
39 AdamW update: $\phi \leftarrow \phi - \eta(t) \cdot \text{Adam}(\nabla_\phi \mathcal{L})$ with $\lambda(t)$ weight decay
40 // EMA update for target encoder
41 $\bar{\theta} \leftarrow m(t) \cdot \bar{\theta} + (1 - m(t)) \cdot \theta$
42 $t \leftarrow t + 1$
43 end for
Algorithm 2: 3D Multi-block Mask Sampling
Input: 3D token grid $(T_d, H_g, W_g)$ (e.g., $8 \times 14 \times 14$); mask generator config: num_blocks $n_b$, spatial_scale $[s_{\min}, s_{\max}]$, temporal_scale $[t_{\min}, t_{\max}]$, aspect_ratio $[a_{\min}, a_{\max}]$
Output: Target token indices $\mathcal{T}$, context token indices $\mathcal{C}$
1 $N \leftarrow T_d \times H_g \times W_g$   // Total tokens (1568)
2 $\text{mask} \leftarrow \mathbf{0}_{T_d \times H_g \times W_g}$   // Binary 3D mask, 0=context, 1=target
3 for $b = 1, \ldots, n_b$ do
4 Sample $s \sim \text{Uniform}(s_{\min}, s_{\max})$   // Spatial scale (fraction of H_g×W_g)
5 Sample $s_t \sim \text{Uniform}(t_{\min}, t_{\max})$   // Temporal scale (fraction of T_d)
6 Sample $a \sim \text{Uniform}(a_{\min}, a_{\max})$   // Aspect ratio
7 $n_{\text{spatial}} \leftarrow \lfloor s \cdot H_g \cdot W_g \rfloor$   // Number of spatial positions per frame
8 $h_b \leftarrow \min(\lfloor \sqrt{n_{\text{spatial}} \cdot a} \rfloor, H_g)$
9 $w_b \leftarrow \min(\lfloor \sqrt{n_{\text{spatial}} / a} \rfloor, W_g)$
10 $t_b \leftarrow \min(\lceil s_t \cdot T_d \rceil, T_d)$   // In V-JEPA: always T_d (full temporal extent)
11 Sample top-left corner $(r_0, c_0, t_0)$: $r_0 \sim [0, H_g - h_b]$, $c_0 \sim [0, W_g - w_b]$, $t_0 \sim [0, T_d - t_b]$
12 $\text{mask}[t_0:t_0{+}t_b,\; r_0:r_0{+}h_b,\; c_0:c_0{+}w_b] \leftarrow 1$
13 end for
14 $\mathcal{T} \leftarrow \{i : \text{mask.flatten}()[i] = 1\}$   // Target indices
15 $\mathcal{C} \leftarrow \{i : \text{mask.flatten}()[i] = 0\}$   // Context indices
16 Truncate $|\mathcal{C}|$ across batch to $\min_b |\mathcal{C}_b|$ for tensor batching
17 return $\mathcal{T}$, $\mathcal{C}$
Algorithm 3: V-JEPA Inference (Frozen Feature Extraction + Attentive Probing)
Input: Video $v$; trained target encoder $f_{\bar{\theta}}$; trained attentive probe $(P_{\text{attn}}, W_{\text{cls}})$; number of temporal segments $S$; spatial crops $C$
Output: Class prediction $\hat{y}$
1 // Multi-view evaluation: $S$ temporal clips × $C$ spatial crops
2 for $s = 1, \ldots, S$; $c = 1, \ldots, C$ do
3 Extract temporal segment $s$ from $v$: sample 16 frames with frame_step=4
4 Apply spatial crop $c$ (left/center/right for 3 crops) → $x_{s,c} \in \mathbb{R}^{16 \times 224 \times 224 \times 3}$
5 Normalize with ImageNet statistics
6 // Frozen encoder forward (no masking)
7 $X_{s,c} = \text{PatchEmbed3D}(x_{s,c}) + \text{pos\_embed}_{3D}$   // 1568×D
8 $H_{s,c} = f_{\bar{\theta}}(X_{s,c})$   // 1568×D (all tokens, no masking)
9 // Attentive probing
10 $q = \text{learnable\_query} \in \mathbb{R}^{1 \times D}$   // Single learnable query token
11 $q' = \text{CrossAttention}(q, H_{s,c})$   // Query attends to frozen features → 1×D
12 $q'' = \text{SelfAttentionBlock}(q')$   // 1 layer of self-attention → 1×D
13 $\text{logits}_{s,c} = W_{\text{cls}} \cdot q'' \in \mathbb{R}^{N_{\text{classes}}}$   // Linear classifier
14 end for
15 $\hat{y} = \arg\max \left( \frac{1}{S \cdot C} \sum_{s,c} \text{softmax}(\text{logits}_{s,c}) \right)$   // Average probabilities across views
16 return $\hat{y}$

7. Training

Step-by-step: One Training Iteration

The following describes exactly what happens during a single training step of V-JEPA, with concrete dimensions for ViT-H/16 at 224 resolution:

Step 1 — Data loading. A mini-batch of $B = 3072$ videos is loaded (distributed across 128 GPUs, 24 per GPU). For each video, 16 frames are sampled at frame_step=4 (covering ~2.1 seconds at 30fps). Random resized crop and normalization yield tensors of shape $(B, 3, 16, 224, 224)$.

Step 2 — Tubelet embedding. PatchEmbed3D (a Conv3d with kernel $(2, 16, 16)$) embeds each clip into $N = 8 \times 14 \times 14 = 1568$ tokens of dimension $D = 1280$. 3D sinusoidal positional embeddings are added. Shape: $(B, 1568, 1280)$.

Step 3 — Mask generation. Two mask generators produce target indices. Generator 1 (short-range): 8 blocks at 15% spatial, full temporal → ~$8 \times 29 \times 8 = 1856$ token-slots, with overlap yielding ~1000–1200 unique target tokens. Generator 2 (long-range): 2 blocks at 70% spatial, full temporal → ~$2 \times 137 \times 8 = 2192$ token-slots. The union of all targets leaves approximately $|\mathcal{C}| \approx 157$ visible context tokens.

Step 4 — Target encoder forward (no gradient). The full token sequence $(B, 1568, 1280)$ is fed through the target encoder (32 transformer blocks). Output is layer-normalized. Target representations are gathered at positions $\mathcal{T}_1$ and $\mathcal{T}_2$ via torch.gather. This produces two target tensors: $h^{(1)}_{\text{tgt}} \in \mathbb{R}^{B \times M_1 \times 1280}$ and $h^{(2)}_{\text{tgt}} \in \mathbb{R}^{B \times M_2 \times 1280}$.

Step 5 — Context encoder forward (with gradient). Only context tokens are gathered: $X_{\text{ctx}} \in \mathbb{R}^{B \times |\mathcal{C}| \times 1280}$, where $|\mathcal{C}| \approx 157$. This passes through the online encoder (32 blocks), producing $z_{\text{ctx}} \in \mathbb{R}^{B \times 157 \times 1280}$. Processing 157 tokens instead of 1568 yields roughly a 10× reduction in encoder FLOPs.

Step 6 — Predictor forward (with gradient). For each mask generator $k \in \{1, 2\}$: (a) Context representations are projected $1280 \to 384$. (b) $M_k$ learnable mask tokens (zero-initialized, $D_p = 384$) with 3D positional embeddings are concatenated. (c) The sequence $(|\mathcal{C}| + M_k, 384)$ passes through 12 predictor blocks. (d) Only target-position outputs are extracted and projected $384 \to 1280$. Result: $\hat{s}^{(k)} \in \mathbb{R}^{B \times M_k \times 1280}$.

Step 7 — Loss computation. L1 loss between $\hat{s}^{(k)}$ and $\text{sg}(h^{(k)}_{\text{tgt}})$, averaged over tokens and mask generators.

Step 8 — Backward pass. Gradients are computed with respect to $\theta$ (encoder) and $\phi$ (predictor). Gradients are clipped at norm 10.0.

Step 9 — Optimizer step. AdamW updates both parameter groups. Learning rate follows cosine schedule from $6.25 \times 10^{-4}$ (peak, after 40 warmup epochs) to $10^{-6}$ (final). Weight decay follows cosine schedule from 0.04 to 0.4.

Step 10 — EMA update. Target encoder parameters are updated: $\bar{\theta} \leftarrow m_t \cdot \bar{\theta} + (1 - m_t) \cdot \theta$, where $m_t$ linearly increases from 0.998 to 1.0 over 112,500 total steps.

Training Iteration Diagram

V-JEPA Training: One Iteration (ViT-H/16, 224px) Step 1: Load B=3072 clips 3072×3×16×224² Step 2: Embed Conv3D 2×16² B×1568×1280 Step 3: Mask 8+2 blocks ~90% masked context: B×157×1280 all: B×1568×1280 Step 5: Encoder f_θ (trainable) 32 blocks, D=1280 Step 4: Target f_θ̄ (EMA, frozen) + LayerNorm + gather B×157×1280 Step 6: Predictor g_φ (trainable) 12L, D_p=384 → D Mask tokens M_k × 384 B×M_k×1280 (ŝ) B×M_k×1280 (sg targets) Step 7: L1 Loss mean(|ŝ - sg(s)|) Steps 8-9: Backward ∇θ, ∇φ → clip(10) → AdamW Step 10: EMA θ̄ ← m·θ̄ + (1-m)·θ gradient flow Schedules LR: 2e-4 → 6.25e-4 → 1e-6 (warmup 40ep → cosine) WD: 0.04 → 0.4 (cosine) EMA: 0.998 → 1.0 (linear) Total: ~90K iters Samples: ~270M GPUs: 128 (16×8) ━━ trainable (∇ flows) ╌╌ frozen / EMA (no ∇)
Figure 3. Complete V-JEPA training iteration. Steps are numbered in execution order. Gradient flows (dashed green lines) reach only the context encoder and predictor. The target encoder is updated solely via EMA (Step 10). Processing only ~10% of tokens through the encoder (Step 5) provides significant compute savings.

8. Inference

V-JEPA's primary evaluation protocol uses a completely frozen encoder with an attentive probing mechanism — a lightweight cross-attention pooler trained on top of frozen features. This protocol is a strict test of feature quality: the encoder cannot adapt to compensate for poor representations.

Attentive Probing (AttentiveClassifier)

The probe consists of an AttentivePooler followed by a linear classifier:

  1. A single learnable query token $q \in \mathbb{R}^{1 \times D}$ attends to frozen encoder features $H \in \mathbb{R}^{N \times D}$ via cross-attention (with $D = 1280$ for ViT-H)
  2. A single self-attention block further refines the query (depth=1, mlp_ratio=4.0)
  3. A linear head projects from $D$ to the number of classes

The attentive probe is trained for 20 epochs with AdamW (lr=0.001, wd=0.01) on the downstream dataset while the encoder remains completely frozen. This adds minimal parameters (~5M) compared to the encoder's hundreds of millions.

The improvement over simple average pooling is dramatic: +17.3 points on K400 and +16.1 points on SSv2. This indicates that V-JEPA's representations are rich in spatially-distributed information that average pooling destroys.

Multi-view Inference

For video classification, V-JEPA uses multi-view testing with notation $F \times S \times C$ meaning $F$ frames per clip, $S$ temporal segments, $C$ spatial crops:

BenchmarkMulti-viewSegmentsCropsattend_across_segments
Kinetics-40016×8×383 (left/center/right)True
Something-Something-v216×2×323True
ImageNet-1Ksingle view11 (center crop)N/A

With attend_across_segments=True, the attentive pooler's query can attend to features from all temporal segments simultaneously, enabling temporal reasoning across the full video duration.

Image Inference

For image tasks (ImageNet-1K, Places205, iNaturalist2021), a single image is treated as a 16-frame "video" by repeating the frame. The tubelet embedding (size 2) processes pairs of identical frames, and 3D positional embeddings are applied normally. The attentive probe attends to all $8 \times 14 \times 14 = 1568$ resulting tokens.

Inference Pipeline Diagram

V-JEPA Inference Pipeline (Frozen Evaluation) Input Video Full length e.g. ~10s clip Multi-view S temporal C spatial K400: 8×3=24 SSv2: 2×3=6 Tubelet Embed 1568×D per view Frozen Encoder f_θ̄ (no masking) All 1568 tokens weights FROZEN 1568×D Attentive Probe CrossAttn(q, H) + SelfAttn query: 1×D (trainable) Linear D → classes trainable Average Probabilities Across S×C Views ŷ = argmax(mean(softmax(logits))) Alternative: Image Inference Image 224×224×3 single frame Repeat ×16 → 16×224²×3 treat as video Tubelet 8×14×14 = 1568 tokens Frozen Encoder same f_θ̄ video encoder on image Attentive Probe → 1×D → class IN1K: 77.9% Prediction argmax(p)
Figure 4. V-JEPA inference pipeline. Top: video classification with multi-view evaluation — the frozen encoder processes each view independently, attentive probing pools features, and softmax probabilities are averaged across views. Bottom: image classification — a single image is repeated 16 times and treated as a video through the same frozen encoder and probe.

9. Results & Benchmarks

Primary Frozen Evaluation (Attentive Probing)

All results below use a completely frozen encoder backbone. Only the lightweight attentive probe (~5M parameters) is trained on the downstream task.

ModelK400 (top-1)SSv2 (top-1)IN1K (top-1)Places205iNat2021AVA (mAP)
V-JEPA ViT-L/16 (224)80.8%69.5%74.8%60.3%67.8%25.6
V-JEPA ViT-H/16 (224)82.0%71.4%75.9%61.7%67.9%25.8
V-JEPA ViT-H/16 (384)81.9%72.2%77.9%62.8%72.6%25.0

Feature Prediction vs. Pixel Reconstruction (ViT-L Ablation)

ObjectiveK400SSv2IN1K
Feature prediction (V-JEPA)73.7%66.2%74.8%
Pixel reconstruction68.6%66.0%73.3%
Δ (feature − pixel)+5.1+0.2+1.5

Feature prediction shows its largest advantage on motion/appearance-balanced tasks (K400: +5.1 points). The SSv2 gap is small because SSv2 is motion-dominant and less affected by low-level texture noise. ImageNet shows a consistent +1.5 point advantage, indicating that feature prediction benefits image tasks as well.

Masking Strategy Ablation (ViT-L)

StrategyK400SSv2
Random-tube [0.9]51.5%46.4%
Causal multi-block [6 frames]61.3%49.8%
Causal multi-block [12 frames]71.9%63.6%
Multi-block (V-JEPA)72.9%67.4%

The multi-block strategy outperforms all alternatives. Random tube masking is catastrophically poor (51.5%), confirming that structured spatial masking across all temporal positions is essential. Causal masking (context from first $p$ frames) performs worse than bidirectional multi-block masking, especially on SSv2 where forward and backward temporal reasoning both matter.

Comparison with Prior Self-supervised Methods (Frozen Evaluation)

MethodArchSamplesIterationsK400SSv2
VideoMAEViT-L410M400K77.8%65.5%
HieraHiera-L770M1500K
V-JEPAViT-L270M90K80.8%69.5%

V-JEPA achieves higher accuracy with fewer samples (270M vs. 410M) and far fewer training iterations (90K vs. 400K–1500K), representing roughly a 2× improvement in compute efficiency over VideoMAE and an order of magnitude fewer iterations than Hiera.

Comparison with Image-pretrained Models (Frozen)

MethodPretrainingK400SSv2
DINOv2 ViT-g/14Image (LVD-142M)83.4%50.6%
OpenCLIP ViT-G/14Image+Text (LAION-2B)81.8%34.8%
VideoMAEv2 ViT-g/14Video (UnlabeledHybrid)71.2%61.2%
V-JEPA ViT-H/16 (384)Video (VideoMix2M)81.9%72.2%

DINOv2 edges out V-JEPA on K400 (83.4% vs. 81.9%) — expected given DINOv2's massive image pretraining data (142M images) and giant architecture. However, on the motion-centric SSv2 benchmark, V-JEPA dominates: 72.2% versus DINOv2's 50.6% and OpenCLIP's 34.8%. This 20+ point gap on SSv2 demonstrates that video pretraining captures temporal dynamics that image-only or image+text methods fundamentally cannot learn.

Label Efficiency (5% Labeled Data, Fine-tuning)

MethodArchK400 (5%)SSv2 (5%)
VideoMAEViT-H/1662.3%41.4%
VideoMAEv2ViT-g/1437.0%28.0%
V-JEPAViT-H/1667.0%54.0%

With only 5% labeled data, V-JEPA outperforms VideoMAE by 4.7 points on K400 and 12.6 points on SSv2. The SSv2 gap is especially large, indicating that V-JEPA's feature-prediction pretraining produces representations that are substantially more data-efficient for downstream temporal reasoning tasks.

Full Fine-tuning Results (ViT-L/16)

MethodK400SSv2
VideoMAE85.4%74.3%
Hiera87.3%75.1%
V-JEPA85.6%75.1%

Under full fine-tuning, V-JEPA matches or exceeds VideoMAE and is competitive with Hiera. The gap between frozen and fine-tuned performance is much smaller for V-JEPA (~5 points on K400) than for pixel-reconstruction methods, indicating that V-JEPA features are already close to task-optimal without adaptation.

Attentive Probing vs. Average Pooling

Pooling MethodK400SSv2
Average pooling + linear~63.5%~53.4%
Attentive probing80.8%69.5%
Δ+17.3+16.1

The +17 point improvement from attentive probing over average pooling indicates that V-JEPA's representations are highly spatially structured — information is distributed across token positions rather than being globally aggregated. Average pooling destroys this spatial structure, while the cross-attention query can selectively attend to the most task-relevant tokens.

10. Connection to JEPA Family

V-JEPA is the first video-native instantiation of the Joint-Embedding Predictive Architecture framework. Its lineage within the JEPA family is direct and well-defined:

Derives from I-JEPA (Assran et al., 2023): V-JEPA inherits I-JEPA's core architecture — online encoder, EMA target encoder, narrow predictor transformer, multi-block masking, and loss in representation space — and extends each component to the spatiotemporal domain. The conceptual framework is identical; the engineering differs in patch embedding (3D tubelets vs. 2D patches), positional embeddings (3D sinusoidal vs. 2D), masking (full-temporal-extent 3D blocks vs. 2D spatial blocks), and loss function (L1 vs. L2).

Conceptual lineage to JEPA (LeCun, 2022): V-JEPA embodies the key principles of the original JEPA position paper: prediction in learned representation space rather than input space, an energy-based formulation with a learned predictor, and a target encoder that provides stable regression targets. V-JEPA also inherits conceptual connections to BYOL-style EMA self-supervised learning (Grill et al., 2020) and the broader family of Siamese SSL methods, though it differs from these by using spatial prediction rather than augmentation-driven invariance.

Key Novelties of V-JEPA within the JEPA Family:
  1. Video-native feature prediction: V-JEPA is the first method to demonstrate that feature prediction (rather than pixel prediction) yields superior video representations, validating the JEPA principle beyond static images.
  2. Full-temporal-extent masking: The design of masking spatial regions across all frames — rather than random spatiotemporal cubes — is a critical insight that prevents trivial temporal interpolation shortcuts. This masking strategy is specific to V-JEPA and not present in I-JEPA's 2D masking.
  3. Frozen evaluation as the primary protocol: While I-JEPA reports both linear probing and fine-tuning, V-JEPA foregrounds frozen evaluation with attentive probing as the primary measure of representation quality. This raises the bar for what constitutes a good self-supervised video model.
  4. No language supervision: Unlike many video SSL methods that leverage text (InternVideo, VideoCLIP), V-JEPA demonstrates that purely visual self-supervised learning from unlabeled video — with no text, no labels, no pretrained encoders — can match or exceed language-supervised approaches on motion-centric benchmarks.
  5. Training efficiency: V-JEPA processes only ~270M video samples in 90K iterations — roughly 2× fewer samples and 4× fewer iterations than comparable pixel-reconstruction methods — demonstrating that operating in feature space is not just qualitatively better but also computationally cheaper.

Influence on subsequent work: V-JEPA establishes the viability of the JEPA paradigm for video understanding and opens the path to further temporal extensions. The MC-JEPA work extends these ideas to multimodal video with motion compensation. V-JEPA's frozen evaluation protocol and attentive probing mechanism have also been adopted by subsequent video representation learning work as a more rigorous evaluation standard. The V-JEPA codebase (hosted at github.com/facebookresearch/jepa) serves as the reference implementation for the JEPA family, with I-JEPA, V-JEPA, and related evaluations sharing a common codebase structure.

11. Summary

Key Takeaway: V-JEPA demonstrates that self-supervised video representation learning works best when the prediction target is abstract features rather than raw pixels. By masking spatial regions across the full temporal extent of a video clip and predicting their representations in a learned latent space — using no pixel reconstruction, no text supervision, no pretrained encoders, and no human annotations — V-JEPA produces video representations that capture both appearance and motion, transferring effectively to downstream tasks with a completely frozen backbone. Main Contributions: (1) First video-native instantiation of the JEPA framework, extending I-JEPA's latent prediction principle from images to video with 3D tubelet embedding, 3D positional encoding, and spatiotemporal multi-block masking. (2) Full-temporal-extent masking design that prevents trivial temporal interpolation, forcing genuine spatiotemporal understanding. (3) State-of-the-art frozen video representations: 81.9% K400, 72.2% SSv2, 77.9% IN1K (ViT-H/16-384) without any adaptation of encoder weights. (4) Over 20 percentage points better than image-pretrained models (DINOv2, OpenCLIP) on the motion-centric SSv2 benchmark. (5) 2× training efficiency over comparable pixel-reconstruction methods, processing only 270M samples in 90K iterations. (6) Empirical validation that the JEPA collapse-prevention mechanism (EMA + narrow predictor + stop-gradient) transfers successfully from images to video without requiring explicit regularization ($\text{reg\_coeff} = 0.0$).

12. References

  1. Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv preprint arXiv:2404.08471.
  2. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
  3. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
  4. Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.
  5. Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., & Qiao, Y. (2023). VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. CVPR 2023.
  6. Feichtenhofer, C., Li, Y., He, K., et al. (2022). Masked Autoencoders As Spatiotemporal Learners. NeurIPS 2022.
  7. Ryali, C., Hu, Y.-T., Bolya, D., Wei, C., Fan, H., Huang, Y., Mangalam, K., Gupta, A., Li, W.Y., Girshick, R.B., Feichtenhofer, C. (2023). Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles. ICML 2023.
  8. Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent — A New Approach to Self-Supervised Learning. NeurIPS 2020.
  9. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR.
  10. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
  11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
  12. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.