Video JEPA (V-JEPA)
1. Introduction
Self-supervised learning from video carries a fundamental promise: video provides a dense, naturally occurring signal about how the visual world changes over time, and any system that can predict future visual states from past observations must, implicitly, understand object permanence, motion dynamics, and scene semantics. Yet the dominant approaches to self-supervised video representation learning have struggled to deliver on this promise. Pixel-reconstruction methods such as VideoMAE (Tong et al., 2022) and MAE-ST (Feichtenhofer et al., 2022) learn by reconstructing masked spatiotemporal patches in raw pixel space, forcing the encoder to waste capacity on unpredictable low-level details — the exact texture of a shirt, the noise in a shadow, the precise value of every RGB pixel. Contrastive methods such as BEVT and ViCLR learn temporal invariances through augmentation-driven objectives, but require careful design of temporal and spatial augmentation pipelines that inevitably inject inductive biases about what should and should not be invariant. Language-supervised methods like VideoCLIP and InternVideo achieve strong zero-shot performance but are fundamentally dependent on paired text data, limiting them to domains where text supervision exists and biasing representations toward concepts expressible in language.
V-JEPA (Video Joint-Embedding Predictive Architecture), introduced by Bardes, Garrido, Ponce, Chen, Rabbat, LeCun, Assran, and Ballas in February 2024, resolves all three limitations by extending the JEPA paradigm from static images to video. V-JEPA learns by predicting the abstract representations of masked spatiotemporal regions in a learned latent space, using no pixel reconstruction, no pretrained image encoders, no text supervision, no negative examples, and no human annotations of any kind. This single principle — predict features, not pixels — yields representations that capture both appearance and motion, work across video and image tasks, and do so with substantially less compute than pixel-reconstruction alternatives.
V-JEPA directly extends I-JEPA (Assran et al., 2023), the image-based instantiation of the JEPA framework. However, the transition from images to video introduces several non-trivial challenges that V-JEPA addresses through three key design decisions:
- Spatiotemporal multi-block masking: I-JEPA masks 2D spatial blocks on a flat patch grid. V-JEPA operates on a 3D spatiotemporal token grid and employs a masking strategy where target blocks span the full temporal extent of the clip — every frame — but cover only a fraction of the spatial area. This design forces the predictor to reason about spatial relationships that persist across time, rather than trivially interpolating from nearby frames.
- Video Vision Transformer with tubelet embedding: V-JEPA uses a 3D patch embedding (tubelet size 2) that fuses pairs of consecutive frames at the input, producing a compact spatiotemporal token sequence. Combined with 3D sinusoidal positional embeddings and standard transformer blocks, this yields an architecture that processes video natively rather than treating it as a bag of independent frames.
- Frozen evaluation protocol: V-JEPA is evaluated primarily with the encoder weights completely frozen, using an attentive probing mechanism (a learnable cross-attention query attending to frozen features). This protocol is a stricter test of representation quality than fine-tuning, since the encoder cannot compensate for poor features by adapting to the downstream task.
The results are compelling. V-JEPA's largest model — a ViT-H/16 trained on 2 million publicly sourced videos — achieves 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K, all with a frozen backbone. On motion-heavy benchmarks like Something-Something-v2, V-JEPA outperforms image-pretrained models (DINOv2: 50.6%, OpenCLIP: 34.8%) by over 20 points, demonstrating that video pretraining captures temporal dynamics that image-only methods fundamentally miss. In label-efficient regimes (5% labeled data), V-JEPA ViT-H achieves 67.0% on K400 versus VideoMAE ViT-H's 62.3% — a gap that highlights the superior semantic quality of feature prediction over pixel reconstruction. Furthermore, V-JEPA processes only 270M total video samples across 90K training iterations, compared to VideoMAE's 410M samples in 400K iterations and Hiera's 770M in 1.5M iterations, representing a roughly 2× improvement in sample and compute efficiency.
This article provides a complete treatment of V-JEPA. Section 2 explains the method with accessible intuitions. Section 3 gives a model overview with an at-a-glance table and architectural diagram. Section 4 dissects each component — encoder, target encoder, predictor, spatiotemporal masking, and loss function — with full mathematical formulations, hyperparameter details, and ablation evidence. Section 5 provides exhaustive implementation details. Section 6 presents formal algorithms for training and inference. Section 7 walks through one training iteration with annotated diagrams. Section 8 describes the inference and downstream evaluation pipeline. Section 9 reports benchmark results and ablation studies. Section 10 situates V-JEPA in the JEPA family lineage, and Section 11 summarizes key takeaways.
2. Method
The core idea of V-JEPA can be stated simply: given a short video clip, mask large spatiotemporal regions of the patch token grid, encode only the visible tokens, and predict the representations of the masked tokens — as produced by a slowly-updating target encoder — without ever reconstructing pixels.
More concretely, V-JEPA's training procedure for a single video clip involves five steps:
- Tubelet embedding: A video clip of $F$ frames at spatial resolution $H \times W$ is divided into non-overlapping 3D patches (tubelets) of size $t_s \times p \times p$ (temporal × height × width), where $t_s = 2$ and $p = 16$. Each tubelet is linearly projected to an embedding vector of dimension $D$. For a clip of 16 frames at 224×224, this yields $T \times H_g \times W_g = 8 \times 14 \times 14 = 1568$ tokens.
- Spatiotemporal masking: Two types of target masks are generated on the 3D token grid. "Short-range" masks consist of 8 small blocks, each covering 15% of the spatial area and spanning the full temporal dimension. "Long-range" masks consist of 2 large blocks, each covering 70% of the spatial area and again spanning full temporal extent. The union of all target blocks leaves approximately 10% of tokens as visible context.
- Context encoding: Only the visible context tokens (the ~10% not covered by any mask) are fed through the online context encoder — a Video Vision Transformer — producing patch-level representations.
- Latent prediction: A smaller predictor transformer receives the context representations and learnable mask tokens positioned at target locations (with appropriate 3D positional embeddings). It predicts the representations of all masked tokens.
- Loss computation: Simultaneously, the entire unmasked clip (all 1568 tokens) is passed through the target encoder — an exponential moving average (EMA) copy of the context encoder with stop-gradient — producing ground-truth representations for the target positions. The L1 distance between predicted and actual target representations is minimized.
Gradients flow only through the context encoder and predictor. The target encoder receives no gradient signal and is updated solely via EMA. This asymmetric design, combined with the information bottleneck of the narrow predictor, prevents representational collapse without requiring contrastive negatives, architectural tricks, or explicit regularization.
3. Model Overview
Architecture at a Glance
| Property | Value |
|---|---|
| Input type | Video frames → 3D tubelet patches (16 frames, patch 16×16, tubelet 2) |
| Masking strategy | Spatiotemporal multi-block: 8 short-range (15% spatial, full temporal) + 2 long-range (70% spatial, full temporal); ~90% tokens masked |
| Encoder architecture | Video Vision Transformer (ViT-L/16 or ViT-H/16) |
| Predictor type | Narrow Vision Transformer (12 layers, 384-dim, 12 heads) |
| Loss function | L1 in representation space: $\frac{1}{|\mathcal{T}|}\sum_{i \in \mathcal{T}} \| \hat{s}_i - \text{sg}(s_i) \|_1$ |
| Key result (frozen eval) | 81.9% K400 · 72.2% SSv2 · 77.9% IN1K (ViT-H/16-384) |
| Parameters | ViT-L: ~307M encoder + ~38M predictor; ViT-H: ~632M encoder + ~47M predictor |
Training Architecture Diagram
4. Main Components of V-JEPA
4.1 Encoder (Video Vision Transformer)
WHAT: The encoder is a standard Vision Transformer adapted for video through 3D patch embedding and 3D positional embeddings. It is the component that produces learned representations and whose weights are ultimately used for downstream tasks.
HOW: The encoder uses PatchEmbed3D — implemented as a single nn.Conv3d with kernel size $(t_s, p, p) = (2, 16, 16)$ and stride equal to kernel size — to project each spatiotemporal tubelet into a $D$-dimensional embedding. For ViT-L, $D = 1024$ with 24 transformer blocks and 16 attention heads. For ViT-H, $D = 1280$ with 32 blocks and 16 heads. Both use an MLP ratio of 4.0 and QKV bias. The architecture uses no [CLS] token — all computation operates on patch tokens only.
Positional embeddings are 3D sinusoidal (non-learnable), computed via get_3d_sincos_pos_embed with uniform_power=True, which allocates $\lceil D/6 \rceil \times 2$ dimensions per spatial/temporal axis and concatenates them (truncated to $D$). This allows interpolation to different spatial resolutions (e.g., 384) via trilinear interpolation.
Weight initialization follows a truncated normal distribution with $\sigma = 0.02$, with a critical residual rescaling: the attention output projection and MLP second-layer weights in block $l$ are rescaled by $1/\sqrt{2l}$, improving training stability for deep networks.
During training, the encoder receives only the unmasked context tokens (not the full token sequence). This is handled by MultiMaskWrapper, which calls apply_masks(x, masks) to gather only the visible token indices via torch.gather before feeding them through the transformer blocks. This yields significant compute savings: processing ~10% of tokens rather than 100%.
WHY: The no-[CLS]-token design is a deliberate choice inherited from I-JEPA. Since V-JEPA's objective is patch-level prediction (predicting representations of specific spatial locations), a global [CLS] token would add no benefit — all information must be spatially resolved. The aggressive residual rescaling ($1/\sqrt{2l}$) is necessary for stable training at depth 32 (ViT-H) with the large effective batch sizes used (3072).
4.2 Target Encoder (EMA)
WHAT: The target encoder is an identical copy of the context encoder whose parameters are updated via exponential moving average (EMA) of the online encoder's parameters, with a complete stop-gradient. It produces the regression targets for the predictor.
HOW: The target encoder is initialized as copy.deepcopy(encoder) with all parameters set to requires_grad=False. After each optimizer step, it is updated as:
where the momentum $m_t$ follows a linear schedule from $m_0 = 0.998$ to $m_T = 1.0$:
$$m_t = m_0 + \frac{t}{T}(m_T - m_0)$$with $T = \text{ipe} \times \text{epochs} \times \text{ipe\_scale} = 300 \times 300 \times 1.25 = 112{,}500$ total steps. The target encoder processes the complete token sequence (all 1568 tokens, no masking), and its output is layer-normalized before being used as regression targets: $s_i = \text{LayerNorm}(\bar{f}_{\bar{\theta}}(x)_i)$.
WHY: The EMA target provides a slowly-evolving, stable regression target that prevents the collapse mode where the encoder could trivially satisfy the prediction objective by producing constant representations. As $m_t$ increases toward 1.0, the target encoder becomes increasingly frozen, which stabilizes late-stage training. Layer normalization of target features is applied in the loss computation (not inside the target encoder architecture) and ensures that the L1 loss operates on scale-normalized features, preventing the trivial collapse solution of shrinking all representations toward zero. The choice of L1 over L2 loss (as used in I-JEPA) was found to provide better training stability for video.
4.3 Predictor
WHAT: The predictor is a narrow transformer that maps from context encoder outputs to predictions of target encoder outputs at masked positions. Its narrowness is a critical bottleneck that forces the encoder to learn rich, spatially-resolved features.
HOW: The predictor (VisionTransformerPredictor) has 12 transformer blocks at dimension $D_p = 384$ with 12 attention heads — substantially narrower than the encoder (1024 or 1280). It operates as follows:
- Context representations are linearly projected: $D \to D_p$ via
predictor_embed - Learnable mask tokens ($D_p$-dimensional, zero-initialized) are placed at target positions. V-JEPA uses 2 distinct mask tokens — one per mask generator — selected via
mask_indexinPredictorMultiMaskWrapper - 3D sinusoidal positional embeddings (same configuration as encoder) are added to both context tokens and mask tokens
- The concatenated sequence $[\hat{h}_{\text{ctx}}; m_1, \ldots, m_M]$ passes through all 12 transformer blocks
- Only the positions corresponding to target tokens are extracted from the output
- A final linear projection maps $D_p \to D$ via
predictor_proj
WHY: The narrow bottleneck ($D_p = 384$ vs. $D = 1280$) is essential for preventing a degenerate solution. If the predictor were as wide as the encoder, it could in principle memorize a lookup table mapping positional embeddings to representations, making the encoder's job trivial. The narrow predictor forces the encoder to produce context representations that are information-rich enough to support prediction despite the bottleneck. The use of 2 distinct mask tokens corresponds to the 2 mask generators (short-range and long-range), allowing the predictor to distinguish which masking pattern produced each target set.
4.4 Spatiotemporal Masking Strategy
WHAT: V-JEPA employs a multi-block 3D masking strategy that generates two types of target masks on the spatiotemporal token grid. The masks determine which tokens the predictor must predict and which tokens the encoder sees as context.
HOW: The masking is implemented in MaskCollator (from src/masks/multiblock3d.py), which operates on a 3D grid of shape $(T_d, H_g, W_g) = (8, 14, 14)$ for a 16-frame, 224×224 clip. Two mask generators are configured:
Short-range masks (Generator 1): 8 blocks, each covering exactly 15% of the spatial area ($14 \times 14 = 196$ spatial positions × 0.15 ≈ 29 positions per block per frame), spanning the full temporal extent ($T_d = 8$). Aspect ratio sampled uniformly from $[0.75, 1.5]$.
Long-range masks (Generator 2): 2 blocks, each covering 70% of the spatial area ($196 \times 0.70 ≈ 137$ positions per block per frame), spanning full temporal extent. Aspect ratio also from $[0.75, 1.5]$.
For each block, the generation algorithm:
- Samples temporal scale $s_t \sim \text{Uniform}(1.0, 1.0)$ → always the full temporal dimension
- Samples spatial scale $s_s \sim \text{Uniform}(s_{\min}, s_{\max})$ (e.g., [0.15, 0.15] or [0.7, 0.7])
- Samples aspect ratio $a \sim \text{Uniform}(0.75, 1.5)$
- Computes block dimensions: $h = \lfloor\sqrt{s_s \cdot H_g \cdot W_g \cdot a}\rfloor$, $w = \lfloor\sqrt{s_s \cdot H_g \cdot W_g / a}\rfloor$, $t = \lceil s_t \cdot T_d \rceil$
- Samples a random center position and clips to grid boundaries
The target mask ($\mathcal{T}$) consists of all token indices within any sampled block. The context mask ($\mathcal{C}$) is the complement: $\mathcal{C} = \{1, \ldots, N\} \setminus \mathcal{T}$. The resulting masking ratio is approximately 90%, leaving roughly 157 visible tokens out of 1568.
Critically, the block sampling uses a shared seed across the batch (via multiprocessing.Value for thread-safe step counting), ensuring deterministic masking that is consistent across DataLoader workers but varies across training steps.
WHY: The full-temporal-extent design is the key architectural insight of V-JEPA. If masks spanned only a few frames, the predictor could reconstruct the missing region by copying from adjacent visible frames at the same spatial position — a strategy that requires no semantic understanding. By masking the same spatial region across all temporal positions, V-JEPA eliminates this shortcut and forces the model to predict what object or scene element occupies a given spatial location. The ablations confirm this decisively: random tube masking at 90% achieves only 51.5% on K400, while multi-block masking achieves 72.9% — a gap of 21.4 points.
4.5 Loss Function
WHAT: V-JEPA minimizes the L1 distance between predicted representations and layer-normalized target representations in the latent space of the target encoder.
HOW: Given a batch of $B$ video clips, let $\mathcal{T}_k$ denote the set of target token indices for the $k$-th mask generator ($k \in \{1, 2\}$), with $|\mathcal{T}_k| = M_k$. For each mask generator $k$:
- $\hat{s}^{(k)}_i \in \mathbb{R}^D$: the predictor's output at target position $i \in \mathcal{T}_k$, projected from $D_p$ back to $D$
- $s^{(k)}_i \in \mathbb{R}^D$: the target encoder's output at position $i$, after layer normalization: $s^{(k)}_i = \text{LayerNorm}(\bar{f}_{\bar{\theta}}(x)_i)$
The JEPA loss is:
$$\mathcal{L}_{\text{JEPA}} = \frac{1}{K} \sum_{k=1}^{K} \frac{1}{M_k} \sum_{i \in \mathcal{T}_k} \left\| \hat{s}^{(k)}_i - \text{sg}(s^{(k)}_i) \right\|_1$$where $K = 2$ is the number of mask generators, $\text{sg}(\cdot)$ denotes stop-gradient, and $\| \cdot \|_1$ denotes the element-wise L1 norm (mean of absolute values across the $D$ dimensions). In code, this is implemented as:
def loss_fn(z, h):
"""z: list of predicted reps, h: list of target reps (layer-normed)."""
loss = 0.
for zi, hi in zip(z, h):
loss += torch.mean(torch.abs(zi - hi)) # loss_exp=1.0 → L1
loss /= len(z) # Average over K mask generators
return loss
The codebase also implements a variance regularization term:
$$\mathcal{L}_{\text{reg}} = \frac{1}{K}\sum_{k=1}^{K} \text{mean}\left(\text{ReLU}\left(1 - \sqrt{\text{Var}_{i \in \mathcal{T}_k}(\hat{s}^{(k)}_i) + \epsilon}\right)\right)$$where $\text{Var}_{i \in \mathcal{T}_k}$ computes the variance across the spatial/temporal (patch) dimension and $\epsilon = 10^{-4}$. This encourages the per-dimension standard deviation of predictions to be at least 1.0, preventing variance collapse. However, in all released configurations, reg_coeff = 0.0, meaning this regularization is disabled in practice — the EMA + stop-gradient + narrow predictor combination is sufficient to prevent collapse without explicit regularization.
The total loss is therefore simply:
$$\mathcal{L} = \mathcal{L}_{\text{JEPA}}$$WHY: V-JEPA uses L1 loss rather than I-JEPA's L2 loss. L1 is more robust to outliers and distributional shifts in the target representations, which are more likely in video (where consecutive frames can have rapid appearance changes) than in static images. The layer normalization applied to targets normalizes the scale of each target vector to have zero mean and unit variance across the $D$ dimension, ensuring that the L1 loss treats all dimensions equally and preventing the model from trivially reducing loss by shrinking representation magnitudes. The fact that variance regularization is unnecessary ($\text{reg\_coeff} = 0.0$) is notable — it empirically demonstrates that V-JEPA's architectural design (EMA target, narrow predictor, aggressive masking) sufficiently prevents collapse without any explicit regularization loss term.
5. Implementation Details
Pretraining Hyperparameters
| Parameter | ViT-L/16 (224) | ViT-H/16 (224) | ViT-H/16 (384) |
|---|---|---|---|
| Encoder layers | 24 | 32 | 32 |
| Encoder embed dim | 1024 | 1280 | 1280 |
| Encoder heads | 16 | 16 | 16 |
| Encoder MLP ratio | 4.0 | 4.0 | 4.0 |
| Patch size | 16 × 16 | 16 × 16 | 16 × 16 |
| Tubelet size | 2 | 2 | 2 |
| Input frames | 16 (frame_step=4) | 16 (frame_step=4) | 16 (frame_step=4) |
| Crop size | 224 | 224 | 384 |
| Num tokens (N) | 8×14×14 = 1568 | 8×14×14 = 1568 | 8×24×24 = 4608 |
| Predictor layers | 12 | 12 | 12 |
| Predictor embed dim | 384 | 384 | 384 |
| Predictor heads | 12 | 12 | 12 |
| Num mask tokens | 2 | 2 | 2 |
| Optimizer | AdamW (β₁=0.9, β₂=0.999, ε=1e-8) | AdamW | AdamW |
| Peak LR | 6.25×10⁻⁴ | 6.25×10⁻⁴ | 6.25×10⁻⁴ |
| Start LR | 2.0×10⁻⁴ | 2.0×10⁻⁴ | 2.0×10⁻⁴ |
| Final LR | 1.0×10⁻⁶ | 1.0×10⁻⁶ | 1.0×10⁻⁶ |
| LR schedule | Linear warmup → cosine decay | same | same |
| Warmup epochs | 40 | 40 | 40 |
| Weight decay (init→final) | 0.04 → 0.4 (cosine) | 0.04 → 0.4 | 0.04 → 0.4 |
| Gradient clip | 10.0 | 10.0 | 10.0 |
| Batch size (per GPU) | 24 | 24 | 10 |
| GPUs | 128 (16 nodes × 8) | 128 (16 nodes × 8) | 240 (30 nodes × 8) |
| Effective batch size | 3072 | 3072 | 2400 |
| Epochs | 300 | 300 | 300 |
| Iterations per epoch | 300 | 300 | 300 |
| ipe_scale | 1.25 | 1.25 | 1.25 |
| Total iterations | ~90,000 | ~90,000 | ~90,000 |
| Total samples processed | ~270M | ~270M | ~216M |
| EMA schedule | 0.998 → 1.0 (linear) | 0.998 → 1.0 | 0.998 → 1.0 |
| Loss exponent | 1.0 (L1) | 1.0 (L1) | 1.0 (L1) |
| Reg coeff | 0.0 (disabled) | 0.0 | 0.0 |
| Mixed precision | bfloat16 | bfloat16 | bfloat16 |
| Positional embed | 3D sinusoidal (uniform_power) | same | same |
| SDPA | True | True | True |
Training Data: VideoMix2M
V-JEPA trains on VideoMix2M, a collection of approximately 2 million publicly available videos compiled from three sources:
| Source | ~Videos | Description |
|---|---|---|
| Kinetics-400/600/700 (merged as K710) | ~700K | Action recognition clips |
| Something-Something-v2 | ~220K | Object manipulation / motion |
| HowTo100M | ~1.1M | Instructional videos (no text used) |
Overlapping videos with downstream evaluation validation/test sets are removed. Data augmentation is minimal: random resized crop (scale $[0.3, 1.0]$, aspect ratio $[0.75, 1.35]$) and ImageNet normalization (mean=$(0.485, 0.456, 0.406)$, std=$(0.229, 0.224, 0.225)$). No auto-augmentation, no random erasing, no motion shift.
Key Class Names (from repository)
| Component | Class | File |
|---|---|---|
| Encoder | VisionTransformer | src/models/vision_transformer.py |
| 3D Patch Embed | PatchEmbed3D | src/models/utils/patch_embed.py |
| Predictor | VisionTransformerPredictor | src/models/predictor.py |
| Encoder Wrapper | MultiMaskWrapper | src/models/utils/multimask.py |
| Predictor Wrapper | PredictorMultiMaskWrapper | src/models/utils/multimask.py |
| 3D Mask Collator | MaskCollator | src/masks/multiblock3d.py |
| Attentive Probe | AttentiveClassifier | src/models/attentive_pooler.py |
| LR Schedule | WarmupCosineSchedule | src/utils/schedulers.py |
| WD Schedule | CosineWDSchedule | src/utils/schedulers.py |
| Training Loop | (procedural) | app/vjepa/train.py |
6. Algorithm
7. Training
Step-by-step: One Training Iteration
The following describes exactly what happens during a single training step of V-JEPA, with concrete dimensions for ViT-H/16 at 224 resolution:
Step 1 — Data loading. A mini-batch of $B = 3072$ videos is loaded (distributed across 128 GPUs, 24 per GPU). For each video, 16 frames are sampled at frame_step=4 (covering ~2.1 seconds at 30fps). Random resized crop and normalization yield tensors of shape $(B, 3, 16, 224, 224)$.
Step 2 — Tubelet embedding. PatchEmbed3D (a Conv3d with kernel $(2, 16, 16)$) embeds each clip into $N = 8 \times 14 \times 14 = 1568$ tokens of dimension $D = 1280$. 3D sinusoidal positional embeddings are added. Shape: $(B, 1568, 1280)$.
Step 3 — Mask generation. Two mask generators produce target indices. Generator 1 (short-range): 8 blocks at 15% spatial, full temporal → ~$8 \times 29 \times 8 = 1856$ token-slots, with overlap yielding ~1000–1200 unique target tokens. Generator 2 (long-range): 2 blocks at 70% spatial, full temporal → ~$2 \times 137 \times 8 = 2192$ token-slots. The union of all targets leaves approximately $|\mathcal{C}| \approx 157$ visible context tokens.
Step 4 — Target encoder forward (no gradient). The full token sequence $(B, 1568, 1280)$ is fed through the target encoder (32 transformer blocks). Output is layer-normalized. Target representations are gathered at positions $\mathcal{T}_1$ and $\mathcal{T}_2$ via torch.gather. This produces two target tensors: $h^{(1)}_{\text{tgt}} \in \mathbb{R}^{B \times M_1 \times 1280}$ and $h^{(2)}_{\text{tgt}} \in \mathbb{R}^{B \times M_2 \times 1280}$.
Step 5 — Context encoder forward (with gradient). Only context tokens are gathered: $X_{\text{ctx}} \in \mathbb{R}^{B \times |\mathcal{C}| \times 1280}$, where $|\mathcal{C}| \approx 157$. This passes through the online encoder (32 blocks), producing $z_{\text{ctx}} \in \mathbb{R}^{B \times 157 \times 1280}$. Processing 157 tokens instead of 1568 yields roughly a 10× reduction in encoder FLOPs.
Step 6 — Predictor forward (with gradient). For each mask generator $k \in \{1, 2\}$: (a) Context representations are projected $1280 \to 384$. (b) $M_k$ learnable mask tokens (zero-initialized, $D_p = 384$) with 3D positional embeddings are concatenated. (c) The sequence $(|\mathcal{C}| + M_k, 384)$ passes through 12 predictor blocks. (d) Only target-position outputs are extracted and projected $384 \to 1280$. Result: $\hat{s}^{(k)} \in \mathbb{R}^{B \times M_k \times 1280}$.
Step 7 — Loss computation. L1 loss between $\hat{s}^{(k)}$ and $\text{sg}(h^{(k)}_{\text{tgt}})$, averaged over tokens and mask generators.
Step 8 — Backward pass. Gradients are computed with respect to $\theta$ (encoder) and $\phi$ (predictor). Gradients are clipped at norm 10.0.
Step 9 — Optimizer step. AdamW updates both parameter groups. Learning rate follows cosine schedule from $6.25 \times 10^{-4}$ (peak, after 40 warmup epochs) to $10^{-6}$ (final). Weight decay follows cosine schedule from 0.04 to 0.4.
Step 10 — EMA update. Target encoder parameters are updated: $\bar{\theta} \leftarrow m_t \cdot \bar{\theta} + (1 - m_t) \cdot \theta$, where $m_t$ linearly increases from 0.998 to 1.0 over 112,500 total steps.
Training Iteration Diagram
8. Inference
V-JEPA's primary evaluation protocol uses a completely frozen encoder with an attentive probing mechanism — a lightweight cross-attention pooler trained on top of frozen features. This protocol is a strict test of feature quality: the encoder cannot adapt to compensate for poor representations.
Attentive Probing (AttentiveClassifier)
The probe consists of an AttentivePooler followed by a linear classifier:
- A single learnable query token $q \in \mathbb{R}^{1 \times D}$ attends to frozen encoder features $H \in \mathbb{R}^{N \times D}$ via cross-attention (with $D = 1280$ for ViT-H)
- A single self-attention block further refines the query (depth=1, mlp_ratio=4.0)
- A linear head projects from $D$ to the number of classes
The attentive probe is trained for 20 epochs with AdamW (lr=0.001, wd=0.01) on the downstream dataset while the encoder remains completely frozen. This adds minimal parameters (~5M) compared to the encoder's hundreds of millions.
The improvement over simple average pooling is dramatic: +17.3 points on K400 and +16.1 points on SSv2. This indicates that V-JEPA's representations are rich in spatially-distributed information that average pooling destroys.
Multi-view Inference
For video classification, V-JEPA uses multi-view testing with notation $F \times S \times C$ meaning $F$ frames per clip, $S$ temporal segments, $C$ spatial crops:
| Benchmark | Multi-view | Segments | Crops | attend_across_segments |
|---|---|---|---|---|
| Kinetics-400 | 16×8×3 | 8 | 3 (left/center/right) | True |
| Something-Something-v2 | 16×2×3 | 2 | 3 | True |
| ImageNet-1K | single view | 1 | 1 (center crop) | N/A |
With attend_across_segments=True, the attentive pooler's query can attend to features from all temporal segments simultaneously, enabling temporal reasoning across the full video duration.
Image Inference
For image tasks (ImageNet-1K, Places205, iNaturalist2021), a single image is treated as a 16-frame "video" by repeating the frame. The tubelet embedding (size 2) processes pairs of identical frames, and 3D positional embeddings are applied normally. The attentive probe attends to all $8 \times 14 \times 14 = 1568$ resulting tokens.
Inference Pipeline Diagram
9. Results & Benchmarks
Primary Frozen Evaluation (Attentive Probing)
All results below use a completely frozen encoder backbone. Only the lightweight attentive probe (~5M parameters) is trained on the downstream task.
| Model | K400 (top-1) | SSv2 (top-1) | IN1K (top-1) | Places205 | iNat2021 | AVA (mAP) |
|---|---|---|---|---|---|---|
| V-JEPA ViT-L/16 (224) | 80.8% | 69.5% | 74.8% | 60.3% | 67.8% | 25.6 |
| V-JEPA ViT-H/16 (224) | 82.0% | 71.4% | 75.9% | 61.7% | 67.9% | 25.8 |
| V-JEPA ViT-H/16 (384) | 81.9% | 72.2% | 77.9% | 62.8% | 72.6% | 25.0 |
Feature Prediction vs. Pixel Reconstruction (ViT-L Ablation)
| Objective | K400 | SSv2 | IN1K |
|---|---|---|---|
| Feature prediction (V-JEPA) | 73.7% | 66.2% | 74.8% |
| Pixel reconstruction | 68.6% | 66.0% | 73.3% |
| Δ (feature − pixel) | +5.1 | +0.2 | +1.5 |
Feature prediction shows its largest advantage on motion/appearance-balanced tasks (K400: +5.1 points). The SSv2 gap is small because SSv2 is motion-dominant and less affected by low-level texture noise. ImageNet shows a consistent +1.5 point advantage, indicating that feature prediction benefits image tasks as well.
Masking Strategy Ablation (ViT-L)
| Strategy | K400 | SSv2 |
|---|---|---|
| Random-tube [0.9] | 51.5% | 46.4% |
| Causal multi-block [6 frames] | 61.3% | 49.8% |
| Causal multi-block [12 frames] | 71.9% | 63.6% |
| Multi-block (V-JEPA) | 72.9% | 67.4% |
The multi-block strategy outperforms all alternatives. Random tube masking is catastrophically poor (51.5%), confirming that structured spatial masking across all temporal positions is essential. Causal masking (context from first $p$ frames) performs worse than bidirectional multi-block masking, especially on SSv2 where forward and backward temporal reasoning both matter.
Comparison with Prior Self-supervised Methods (Frozen Evaluation)
| Method | Arch | Samples | Iterations | K400 | SSv2 |
|---|---|---|---|---|---|
| VideoMAE | ViT-L | 410M | 400K | 77.8% | 65.5% |
| Hiera | Hiera-L | 770M | 1500K | — | — |
| V-JEPA | ViT-L | 270M | 90K | 80.8% | 69.5% |
V-JEPA achieves higher accuracy with fewer samples (270M vs. 410M) and far fewer training iterations (90K vs. 400K–1500K), representing roughly a 2× improvement in compute efficiency over VideoMAE and an order of magnitude fewer iterations than Hiera.
Comparison with Image-pretrained Models (Frozen)
| Method | Pretraining | K400 | SSv2 |
|---|---|---|---|
| DINOv2 ViT-g/14 | Image (LVD-142M) | 83.4% | 50.6% |
| OpenCLIP ViT-G/14 | Image+Text (LAION-2B) | 81.8% | 34.8% |
| VideoMAEv2 ViT-g/14 | Video (UnlabeledHybrid) | 71.2% | 61.2% |
| V-JEPA ViT-H/16 (384) | Video (VideoMix2M) | 81.9% | 72.2% |
DINOv2 edges out V-JEPA on K400 (83.4% vs. 81.9%) — expected given DINOv2's massive image pretraining data (142M images) and giant architecture. However, on the motion-centric SSv2 benchmark, V-JEPA dominates: 72.2% versus DINOv2's 50.6% and OpenCLIP's 34.8%. This 20+ point gap on SSv2 demonstrates that video pretraining captures temporal dynamics that image-only or image+text methods fundamentally cannot learn.
Label Efficiency (5% Labeled Data, Fine-tuning)
| Method | Arch | K400 (5%) | SSv2 (5%) |
|---|---|---|---|
| VideoMAE | ViT-H/16 | 62.3% | 41.4% |
| VideoMAEv2 | ViT-g/14 | 37.0% | 28.0% |
| V-JEPA | ViT-H/16 | 67.0% | 54.0% |
With only 5% labeled data, V-JEPA outperforms VideoMAE by 4.7 points on K400 and 12.6 points on SSv2. The SSv2 gap is especially large, indicating that V-JEPA's feature-prediction pretraining produces representations that are substantially more data-efficient for downstream temporal reasoning tasks.
Full Fine-tuning Results (ViT-L/16)
| Method | K400 | SSv2 |
|---|---|---|
| VideoMAE | 85.4% | 74.3% |
| Hiera | 87.3% | 75.1% |
| V-JEPA | 85.6% | 75.1% |
Under full fine-tuning, V-JEPA matches or exceeds VideoMAE and is competitive with Hiera. The gap between frozen and fine-tuned performance is much smaller for V-JEPA (~5 points on K400) than for pixel-reconstruction methods, indicating that V-JEPA features are already close to task-optimal without adaptation.
Attentive Probing vs. Average Pooling
| Pooling Method | K400 | SSv2 |
|---|---|---|
| Average pooling + linear | ~63.5% | ~53.4% |
| Attentive probing | 80.8% | 69.5% |
| Δ | +17.3 | +16.1 |
The +17 point improvement from attentive probing over average pooling indicates that V-JEPA's representations are highly spatially structured — information is distributed across token positions rather than being globally aggregated. Average pooling destroys this spatial structure, while the cross-attention query can selectively attend to the most task-relevant tokens.
10. Connection to JEPA Family
V-JEPA is the first video-native instantiation of the Joint-Embedding Predictive Architecture framework. Its lineage within the JEPA family is direct and well-defined:
Derives from I-JEPA (Assran et al., 2023): V-JEPA inherits I-JEPA's core architecture — online encoder, EMA target encoder, narrow predictor transformer, multi-block masking, and loss in representation space — and extends each component to the spatiotemporal domain. The conceptual framework is identical; the engineering differs in patch embedding (3D tubelets vs. 2D patches), positional embeddings (3D sinusoidal vs. 2D), masking (full-temporal-extent 3D blocks vs. 2D spatial blocks), and loss function (L1 vs. L2).
Conceptual lineage to JEPA (LeCun, 2022): V-JEPA embodies the key principles of the original JEPA position paper: prediction in learned representation space rather than input space, an energy-based formulation with a learned predictor, and a target encoder that provides stable regression targets. V-JEPA also inherits conceptual connections to BYOL-style EMA self-supervised learning (Grill et al., 2020) and the broader family of Siamese SSL methods, though it differs from these by using spatial prediction rather than augmentation-driven invariance.
- Video-native feature prediction: V-JEPA is the first method to demonstrate that feature prediction (rather than pixel prediction) yields superior video representations, validating the JEPA principle beyond static images.
- Full-temporal-extent masking: The design of masking spatial regions across all frames — rather than random spatiotemporal cubes — is a critical insight that prevents trivial temporal interpolation shortcuts. This masking strategy is specific to V-JEPA and not present in I-JEPA's 2D masking.
- Frozen evaluation as the primary protocol: While I-JEPA reports both linear probing and fine-tuning, V-JEPA foregrounds frozen evaluation with attentive probing as the primary measure of representation quality. This raises the bar for what constitutes a good self-supervised video model.
- No language supervision: Unlike many video SSL methods that leverage text (InternVideo, VideoCLIP), V-JEPA demonstrates that purely visual self-supervised learning from unlabeled video — with no text, no labels, no pretrained encoders — can match or exceed language-supervised approaches on motion-centric benchmarks.
- Training efficiency: V-JEPA processes only ~270M video samples in 90K iterations — roughly 2× fewer samples and 4× fewer iterations than comparable pixel-reconstruction methods — demonstrating that operating in feature space is not just qualitatively better but also computationally cheaper.
Influence on subsequent work: V-JEPA establishes the viability of the JEPA paradigm for video understanding and opens the path to further temporal extensions. The MC-JEPA work extends these ideas to multimodal video with motion compensation. V-JEPA's frozen evaluation protocol and attentive probing mechanism have also been adopted by subsequent video representation learning work as a more rigorous evaluation standard. The V-JEPA codebase (hosted at github.com/facebookresearch/jepa) serves as the reference implementation for the JEPA family, with I-JEPA, V-JEPA, and related evaluations sharing a common codebase structure.
11. Summary
12. References
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv preprint arXiv:2404.08471.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
- Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.
- Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., & Qiao, Y. (2023). VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. CVPR 2023.
- Feichtenhofer, C., Li, Y., He, K., et al. (2022). Masked Autoencoders As Spatiotemporal Learners. NeurIPS 2022.
- Ryali, C., Hu, Y.-T., Bolya, D., Wei, C., Fan, H., Huang, Y., Mangalam, K., Gupta, A., Li, W.Y., Girshick, R.B., Feichtenhofer, C. (2023). Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles. ICML 2023.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent — A New Approach to Self-Supervised Learning. NeurIPS 2020.
- Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021.
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.