At a glance
ProblemGlobal, last-layer-only latent video prediction underserves dense (spatially localized) tasks, leaves intermediate layers under-shaped, and single-modality tokenization limits applicability.
Key ideaKeep the V-JEPA 2 latent-prediction core but add a dense predictive loss, deep self-supervision at multiple depths, and multimodal tokenizers to produce strong dense video and image features.
ModalityVideo (and image) — ViT over tubelets, multimodal input interface
Target / maskingEMA-target latent prediction over masked spatiotemporal tokens, refined with spatially fine-grained (dense) targets.
Builds onV-JEPA 2 and the V-JEPA recipe; EMA self-distillation.
Used forReusable backbones for dense prediction; released checkpoints exposing dense video and image features.

Motivation

As latent video prediction is scaled toward broadly useful, transferable representations, several limitations surface. A purely global feature-prediction objective rewards coarse agreement and can underserve dense, spatially localized tasks such as segmentation or per-location prediction. Applying supervision only at the network's final layer leaves intermediate representations under-shaped, even though downstream heads often read from them. And single-modality tokenization restricts where the model can be applied. V-JEPA 2.1 sets out to refine V-JEPA 2 into a stronger producer of both dense video and image features, addressing these gaps without abandoning the JEPA core.

How it works

Videotubelets · multi-block spatiotemporalContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Video. The input is split into a visible context and hidden targets (tubelet-level, multi-block spatiotemporal). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The JEPA core is retained: a context encoder over the visible spatiotemporal tokens, an EMA target encoder supplying stop-gradient prediction targets, and a predictor over the masked tokens trained with a latent prediction loss. Three refinements are layered on:

  • A dense predictive loss sharpens spatially fine-grained feature prediction rather than only coarse, global agreement.
  • Deep self-supervision applies the predictive objective at multiple network depths, so intermediate layers are directly shaped and become more transferable.
  • Multimodal tokenizers broaden the input interface across modalities, widening where the model applies.

The work also releases pretrained checkpoints that expose dense video and image features for direct downstream use.

The objective

On top of the standard masked latent-prediction loss between predicted and EMA-target features, V-JEPA 2.1 adds the dense and depth-distributed terms:

$$\mathcal{L} = \sum_{\ell \in \mathcal{D}} \beta_\ell \sum_{k \in \mathcal{M}} \big\lVert\, g_\phi^{(\ell)}(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k^{(\ell)}\big]\,\big\rVert$$

where $\mathcal{D}$ is the set of supervised depths (deep self-supervision), $\mathcal{M}$ the masked tokens, $\operatorname{sg}$ stop-gradient, and the dense term enforces matching at fine spatial granularity rather than only pooled, global agreement. The EMA target and stop-gradient continue to provide the asymmetry that prevents collapse.

Key results & what's novel

The contribution is an objective-and-engineering advance that converts the V-JEPA family from primarily clip-level representation learners into reusable backbones for dense prediction, packaged with released weights. The dense predictive loss and deep self-supervision improve the quality and transferability of internal representations, while multimodal tokenizers broaden applicability. By distributing supervision across depths and sharpening spatial granularity, the model better supports tasks that need per-location structure, and the public checkpoints exposing dense video and image features lower the barrier to adoption.

Strengths & limitations

  • + Dense loss makes features useful for spatially grounded tasks, not just clip-level classification.
  • + Deep self-supervision shapes intermediate layers, improving transfer.
  • + Multimodal tokenizers and released checkpoints make it directly reusable.
  • More loss terms and per-depth weights mean more hyperparameters to balance.
  • Deep, dense supervision adds training cost and memory over plain V-JEPA 2.
  • Still a representation learner: like V-JEPA, it has no action conditioning on its own, so planning relies on the action-conditioned siblings.

Connections & references