AuthorsAssran, Bardes, Fan, Garrido, Howes, Muckley, Ballas, LeCun, Rabbat et al.
Date2025-06
CategoryPhysics / World Models
Derives fromV-JEPA

1. Introduction

Self-supervised video representation learning has made remarkable strides in producing encoders that capture rich spatiotemporal semantics. V-JEPA (Bardes et al., 2024) demonstrated that masking large spatiotemporal regions of video and predicting their latent representations—rather than pixel values—yields powerful features for video understanding. Yet V-JEPA, like most self-supervised video models, is a passive observer: it learns to represent and recognize what has happened, but it cannot anticipate what will happen in response to a specific action. This limitation is fundamental. A model that merely classifies activities cannot serve as the internal simulator an embodied agent needs to plan, reason about consequences, and act intelligently in the physical world.

V-JEPA 2 (Assran, Bardes, Fan, Garrido, Howes, Muckley, Ballas, LeCun, Rabbat et al., 2025) addresses this gap by extending the JEPA framework from a video understanding model into an action-conditioned world model. The central thesis is that a single self-supervised architecture should be capable of three interlocking competencies:

  1. Understanding — recognizing objects, scenes, and activities from video;
  2. Prediction — forecasting future physical states given a context and an intended action;
  3. Planning — using internal rollouts of predicted future states to select actions that achieve a goal.

Where V-JEPA learns $s_x = f_\theta(x)$ as a representation of a video clip $x$, V-JEPA 2 learns a transition function $\hat{s}_{t+1} = g_\phi(s_t, a_t)$ that predicts the next latent state $\hat{s}_{t+1}$ from the current latent state $s_t$ and an action $a_t$. This transition function, operating entirely in the learned representation space, constitutes a world model: an internal simulator that the agent can query before committing to any physical action.

The key contributions of V-JEPA 2 are:

  • A two-phase training procedure: (1) self-supervised pretraining via the V-JEPA objective on large-scale video, followed by (2) action-conditioned world-model training that learns a latent dynamics model from video-action pairs;
  • A planning framework that uses the learned world model for model-predictive control (MPC), selecting actions by internally simulating future trajectories and evaluating them against a goal;
  • State-of-the-art results on video understanding benchmarks (inherited from the V-JEPA encoder) and strong performance on physical prediction and planning tasks, demonstrating that a single architecture can serve all three roles;
  • Evidence that latent-space prediction substantially outperforms pixel-space generative world models for planning, while being orders of magnitude more computationally efficient.
From observer to actor. V-JEPA watches video and learns "what things look like and how they move." V-JEPA 2 additionally learns "what happens if I do this," turning a passive representation into an active world model that can plan.

2. Method

The design of V-JEPA 2 rests on a simple but powerful insight: the best way to understand the physical world is to learn to predict it. Rather than training a model to classify labels or reconstruct pixels, V-JEPA 2 asks the model to predict the future—not in pixel space, where irrelevant details like exact textures and lighting dominate, but in a learned abstract space where only the physically meaningful aspects of the scene are retained.

Analogy: The chess player's mental board. A strong chess player does not visualize every grain of wood on each piece when thinking ahead. Instead, they maintain an abstract representation—piece positions, threats, material balance—and mentally simulate how the board state changes after each candidate move. V-JEPA 2 does the same for physical environments: it builds an abstract "mental video" of the world and simulates how that abstract state evolves in response to actions. Planning then amounts to finding the action sequence whose mentally simulated outcome best matches the goal.

Phase 1: Learning to See (V-JEPA Pretraining)

The first phase is pure self-supervised video representation learning, identical to the original V-JEPA. Large spatiotemporal blocks of a video are masked, and a predictor network must reconstruct the latent representations of those masked regions, given the unmasked context. The target representations are produced by an exponential moving average (EMA) encoder, preventing representation collapse without requiring negative pairs or pixel reconstruction. This phase produces an encoder that maps raw video frames to rich spatiotemporal features.

Why start with V-JEPA? The world model needs a good "language" to describe physical states before it can learn to predict transitions between them. V-JEPA pretraining provides that language: a representation space where similar physical configurations map to nearby points, and where irrelevant pixel-level variation is suppressed.

Phase 2: Learning to Predict and Plan (Action-Conditioned World Model)

The second phase adds action conditioning. Given a video sequence paired with the agent's actions, the model learns to predict the latent representation of the next frame given the current latent state and the action taken. Think of it as teaching the model: "If the world looks like this (current state embedding) and I do that (action), then the world will look like this (next state embedding)."

This is trained as a next-state prediction task in the frozen (or slowly adapting) representation space established by Phase 1. The target representations come from the same EMA encoder used during V-JEPA pretraining, ensuring consistency. The action-conditioned predictor is the only component that is newly trained in this phase.

Analogy: Learning physics by playing. Imagine a child who has already learned to see and recognize objects (Phase 1). Now they pick up a ball and throw it in various directions, watching the result each time. Over many such trials, they build an internal model: "If I throw the ball at this angle with this force, it will land roughly there." V-JEPA 2's Phase 2 is this process—learning action-to-outcome mappings—except entirely from video-action data, and in the abstract representation space rather than raw visual imagination.

Planning via Internal Simulation

Once the world model is trained, planning becomes a search problem. Given a current state and a goal state (both encoded via the pretrained encoder), the model evaluates candidate action sequences by "mentally" rolling them out: it iteratively applies the transition function to predict future latent states, then scores each trajectory by how close the final predicted state is to the goal state. The best-scoring action sequence is executed. This is a form of model-predictive control (MPC), where the "model" is the learned latent dynamics.

The critical advantage of planning in latent space is speed: each rollout step is a single forward pass through the predictor (a relatively small network), rather than a full video-generation pass. This makes it feasible to evaluate hundreds or thousands of candidate trajectories in real time.

3. Model Overview

At-a-Glance

AttributeV-JEPA 2
InputVideo frames + agent actions (video-action pairs)
Phase 1 MaskingMulti-block spatiotemporal masking (V-JEPA style): ~75–90% of patches masked
Phase 2 MaskingN/A (next-state prediction; full frames used)
EncoderVision Transformer (ViT-H/16 or ViT-G), spatiotemporal patch embedding
Target EncoderEMA copy of the encoder (momentum-updated)
Phase 1 PredictorNarrow ViT predictor (V-JEPA masking objective)
Phase 2 PredictorAction-conditioned latent dynamics model
Phase 1 LossL2 in latent space (masked patch prediction)
Phase 2 LossL2 in latent space (next-state prediction, action-conditioned)
PlanningModel-predictive control (MPC) via latent rollouts + sampling-based optimization
Key ResultState-of-the-art on video understanding; strong physical prediction and planning; outperforms pixel-space world models
Parameters~632M (ViT-H encoder) / ~1.0B+ (ViT-G encoder); predictor ~50–100M additional

Training Architecture Diagram

V-JEPA 2 — Two-Phase Training Architecture PHASE 1: V-JEPA Pretraining Video x T×H×W×3 ST Masking ~75-90% Encoder f_θ ViT-H (trainable) Target Enc f_ξ EMA (frozen) EMA Predictor p_ψ Narrow ViT L2 Loss ∇ → f_θ, p_ψ only sg(f_ξ): no gradient PHASE 2: World Model Training Frame x_t H×W×3 Action a_t d_a dims x_{t+1} target Encoder f_θ frozen/slow s_t ∈ ℝ^D Target Enc EMA (frozen) Action Enc MLP → ℝ^D Dynamics Pred g_ϕ g_ϕ(s_t, a_t) → ŝ_{t+1} Transformer (trainable) ŝ_{t+1} ∈ ℝ^D s*_{t+1} ∈ ℝ^D L2 Loss ∇ → g_ϕ, Action Enc only sg(f_θ), sg(f_ξ)
Figure 1. V-JEPA 2 two-phase training architecture. Phase 1 (left): standard V-JEPA pretraining with spatiotemporal masking and latent prediction. Phase 2 (right): action-conditioned world model training. The encoder is frozen (or slowly adapted); only the dynamics predictor and action encoder receive gradients. Both phases use L2 loss in the latent space produced by the EMA target encoder.

4. Main Components of V-JEPA 2

4.1 Encoder $f_\theta$

What. The encoder is a Vision Transformer (ViT) that maps raw video frames (or short clips) into a sequence of latent patch tokens. In Phase 1, it processes only the unmasked (context) patches. In Phase 2, it processes entire frames to produce full state representations.

How. V-JEPA 2 uses ViT-H/16 (632M parameters, 32 layers, 16 heads, embedding dimension $D = 1280$, patch size $16 \times 16$) as the primary backbone, with experiments also conducted at ViT-G scale. The encoder performs spatiotemporal patch embedding: each patch of size $t_p \times 16 \times 16$ (where $t_p = 2$ for temporal tubelet embedding) is linearly projected to dimension $D$. Positional embeddings (learnable or sinusoidal spatiotemporal) are added to encode the $(t, h, w)$ grid position of each patch.

For a video clip of $T$ frames at resolution $H \times W$, the encoder produces a token sequence of length $N = \frac{T}{t_p} \times \frac{H}{16} \times \frac{W}{16}$. For typical settings ($T = 16$, $H = W = 224$), this yields $N = 8 \times 14 \times 14 = 1568$ tokens, each in $\mathbb{R}^D$.

Why. The ViT architecture is chosen for its ability to model long-range spatiotemporal dependencies through self-attention and its scalability to large model sizes. The V-JEPA authors previously demonstrated that ViT-H significantly outperforms smaller backbones on video understanding benchmarks after self-supervised pretraining, with ViT-H/16 providing a strong accuracy-compute tradeoff.

4.2 Target Encoder $f_\xi$ (EMA)

What. The target encoder is a structurally identical copy of the encoder whose parameters $\xi$ are updated as an exponential moving average (EMA) of the online encoder parameters $\theta$. It produces the prediction targets for both Phase 1 and Phase 2 training.

How. After each training step, the target encoder parameters are updated: $$\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$$ where $\tau$ is the momentum coefficient. In Phase 1, $\tau$ follows a cosine schedule from $\tau_0 = 0.996$ to $\tau_1 = 1.0$ over the course of training, such that the target encoder increasingly lags behind the online encoder. In Phase 2, the target encoder is typically frozen at the final Phase 1 state, or continues with very high momentum ($\tau \geq 0.999$).

The target encoder processes the full (unmasked) video in Phase 1 to produce target representations for the masked patches. In Phase 2, it processes the future frame $x_{t+1}$ to produce the target next-state representation $s^*_{t+1}$.

Why. The EMA target encoder serves two purposes: (1) it provides a slowly evolving prediction target that stabilizes training and prevents representational collapse—a well-established technique from BYOL and DINO; (2) it produces targets that reflect a slightly smoothed, temporally averaged version of the learned representations, which has been empirically shown to improve feature quality. The stop-gradient through the target encoder ensures that the loss cannot be trivially minimized by making all representations identical.

4.3 Phase 1 Predictor $p_\psi$ (Masking Predictor)

What. A narrow Transformer that takes the context encoder output (representations of unmasked patches) plus learnable mask tokens at the positions of masked patches, and predicts the target representations for those masked positions.

How. The predictor is a ViT with substantially reduced capacity compared to the encoder: typically 12 layers, 12 heads, and embedding dimension $D_p = 384$ (versus $D = 1280$ for the encoder). Positional embeddings are shared with or matched to those of the encoder to preserve spatial structure. Mask tokens are learnable vectors in $\mathbb{R}^{D_p}$, placed at positions corresponding to the masked patches.

The predictor takes as input: (1) the context representations from the encoder, projected from $D$ to $D_p$ via a linear layer; (2) mask tokens at the target positions. It outputs predicted representations for the masked positions, which are then projected back to $D$ for comparison with the target encoder outputs.

Why. The bottleneck architecture of the predictor is critical for collapse prevention. A predictor with capacity equal to the encoder could simply learn to copy representations, reducing the loss without learning meaningful features. By constraining the predictor's capacity, the system is forced to learn a compressed, semantically rich representation in the encoder. V-JEPA ablations show that reducing predictor width from 1280 to 384 significantly improves downstream performance, with further reduction yielding diminishing returns.

4.4 Phase 2 Predictor: Action-Conditioned Dynamics Model $g_\phi$

What. The core novelty of V-JEPA 2. This component takes the current latent state $s_t$ and an action $a_t$, and predicts the next latent state $\hat{s}_{t+1}$. It is the component that transforms V-JEPA from a recognition model into a world model.

How. The dynamics model operates in the latent space of the (frozen or slow-adapting) encoder. It consists of:

  1. Action encoder: An MLP that maps the raw action vector $a_t \in \mathbb{R}^{d_a}$ to the same latent dimension $D$ as the state representation. This may involve several layers with nonlinearities (e.g., GELU), producing $e_a = \text{MLP}(a_t) \in \mathbb{R}^D$.
  2. State-action fusion: The action embedding is combined with the state tokens. This can be done by (a) concatenating the action embedding as an additional token to the state token sequence, (b) adding the action embedding to each state token, or (c) using cross-attention between state tokens and the action embedding. The paper explores concatenation and additive fusion.
  3. Transformer predictor: A Transformer that processes the fused state-action representation and outputs predicted next-state tokens $\hat{s}_{t+1} \in \mathbb{R}^{N \times D}$ (or a pooled global state vector, depending on the prediction granularity).

For multi-step rollouts during planning, the model is applied autoregressively: $$\hat{s}_{t+k} = g_\phi(\hat{s}_{t+k-1}, a_{t+k-1})$$ where each predicted state feeds into the next prediction step.

Why. Operating in latent space rather than pixel space makes the dynamics model both more tractable and more effective. Pixel-space models (e.g., video generation approaches) must predict every visual detail—exact textures, lighting, backgrounds—most of which are irrelevant to physical reasoning and planning. By predicting in V-JEPA's learned representation space, the dynamics model focuses on the physically salient aspects of state transitions. The paper demonstrates that this latent-space approach outperforms pixel-space world models on planning benchmarks while requiring significantly less compute per rollout step.

4.5 Masking Strategy (Phase 1)

Phase 1 uses the same multi-block masking strategy as V-JEPA: multiple spatiotemporal blocks are randomly sampled to form the target set, and the remaining visible patches form the context.

V-JEPA 2 Phase 1: Spatiotemporal Multi-Block Masking t=1 t=4 t=8 t=12 t=16 Context (visible) Target block 1 Target block 2 temporal extent → Multi-block masking: 4–8 target blocks, each spanning contiguous spatial regions across temporal tubes. Total masking ratio: ~75–90% of all patches. Context patches → Encoder f_θ; Target patches → Target Enc f_ξ (full video)
Figure 2. V-JEPA 2 Phase 1 masking strategy. Multiple spatiotemporal blocks are sampled as targets (green and red blocks), while remaining visible patches form the context. Each target block is contiguous in both space and time, forcing the encoder to learn holistic spatiotemporal representations rather than relying on local pixel interpolation.

The masking procedure samples $M \in [4, 8]$ target blocks. Each block has a spatial extent drawn uniformly in the range $[0.15, 0.7]$ of the spatial dimensions and a temporal extent spanning $[0.4, 1.0]$ of the temporal dimension. The union of all target blocks constitutes $\sim$75–90% of all patches. This aggressive masking ratio, combined with the spatiotemporal contiguity of each block, forces the encoder to develop representations that capture high-level scene structure, motion patterns, and object permanence—shallow texture interpolation is insufficient when most of the video is hidden.

4.6 Loss Function

V-JEPA 2 uses mean squared error (MSE) in the latent space for both training phases.

Phase 1 Loss (V-JEPA Masked Prediction)

Let $\mathcal{M}$ denote the set of masked patch indices. The Phase 1 predictor $p_\psi$ takes the encoder output for unmasked patches and mask tokens at positions $\mathcal{M}$, and produces predictions $\hat{z}_i$ for each $i \in \mathcal{M}$. The target encoder produces target representations $z^*_i$ by encoding the full (unmasked) video. The loss is:

$$\mathcal{L}_{\text{Phase1}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \left\| \hat{z}_i - \text{sg}(z^*_i) \right\|_2^2$$

where:

  • $\hat{z}_i = p_\psi(\{f_\theta(x_j)\}_{j \notin \mathcal{M}}, \, m_i) \in \mathbb{R}^D$ — predicted representation for masked patch $i$, produced by the predictor given context encoder outputs and mask token $m_i$ at position $i$;
  • $z^*_i = f_\xi(x)_i \in \mathbb{R}^D$ — target representation for patch $i$, extracted from the target encoder's output over the full video;
  • $\text{sg}(\cdot)$ — stop-gradient operator; no gradient flows through the target encoder;
  • $|\mathcal{M}|$ — number of masked patches;
  • $\|\cdot\|_2$ — L2 norm in $\mathbb{R}^D$.

Optionally, the target representations are layer-normalized before computing the loss, which stabilizes training by preventing the target scale from drifting.

Phase 2 Loss (Action-Conditioned Next-State Prediction)

Given the current state $s_t = f_\theta(x_t)$ (or a pooled version thereof) and action $a_t$, the dynamics predictor produces $\hat{s}_{t+1} = g_\phi(s_t, a_t)$. The target is $s^*_{t+1} = f_\xi(x_{t+1})$, the EMA encoder's representation of the actual next frame. The loss is:

$$\mathcal{L}_{\text{Phase2}} = \frac{1}{B \cdot N} \sum_{b=1}^{B} \sum_{n=1}^{N} \left\| \hat{s}_{t+1}^{(b,n)} - \text{sg}\!\left(s^{*\,(b,n)}_{t+1}\right) \right\|_2^2$$

where:

  • $B$ — batch size;
  • $N$ — number of spatial tokens in the state representation (or $N=1$ if using global pooling);
  • $\hat{s}_{t+1}^{(b,n)} \in \mathbb{R}^D$ — predicted next-state token $n$ for batch element $b$;
  • $s^{*\,(b,n)}_{t+1} \in \mathbb{R}^D$ — target next-state token from the EMA encoder;
  • $g_\phi$ — the action-conditioned dynamics predictor (trainable);
  • $\text{sg}(\cdot)$ — stop-gradient; no gradient flows to the encoder or target encoder.

For multi-step prediction, the loss can be extended over a horizon $H$:

$$\mathcal{L}_{\text{multi-step}} = \frac{1}{H} \sum_{k=1}^{H} \lambda^{k-1} \cdot \frac{1}{B} \sum_{b=1}^{B} \left\| \hat{s}_{t+k}^{(b)} - \text{sg}\!\left(s^{*\,(b)}_{t+k}\right) \right\|_2^2$$

where $\lambda \in (0, 1]$ is an optional discount factor that reduces the weight of longer-horizon predictions, and $\hat{s}_{t+k}$ is computed by autoregressive rollout through $g_\phi$.

4.7 Planning Module (Model-Predictive Control)

What. The planning module uses the trained world model to select actions via model-predictive control (MPC). Given a current observation and a goal, it evaluates candidate action sequences by simulating their outcomes in latent space and selects the sequence whose predicted terminal state is closest to the goal.

How. The planning procedure uses sampling-based optimization, specifically the Cross-Entropy Method (CEM) or a derivative thereof:

  1. Encode the current observation: $s_0 = f_\theta(x_\text{current})$
  2. Encode the goal: $s_\text{goal} = f_\theta(x_\text{goal})$
  3. Sample $K$ candidate action sequences $\{(a_0^k, a_1^k, \ldots, a_{H-1}^k)\}_{k=1}^K$ from a distribution $\pi$ (initially Gaussian)
  4. For each candidate sequence, roll out the world model: $$\hat{s}_{t+1}^k = g_\phi(\hat{s}_t^k, a_t^k), \quad t = 0, \ldots, H-1$$
  5. Score each trajectory: $\text{cost}_k = \|\hat{s}_H^k - s_\text{goal}\|_2^2$
  6. Select the top-$E$ elite sequences (lowest cost), refit $\pi$ to the elite set
  7. Repeat steps 3–6 for $I$ iterations of CEM
  8. Execute the first action of the best sequence; re-plan at the next time step

Why. MPC with CEM is a well-established approach for planning with learned dynamics models. It makes minimal assumptions about the structure of the action space and naturally handles multi-modal action distributions through population-based search. The key advantage of V-JEPA 2's planning over pixel-space world models is computational efficiency: each rollout step through $g_\phi$ is a small forward pass (the predictor, not the full encoder), enabling thousands of rollouts per planning step in real time.

5. Implementation Details

HyperparameterPhase 1 (V-JEPA Pretraining)Phase 2 (World Model)
Encoder architectureViT-H/16ViT-H/16 (frozen or slow-adapt)
Encoder layers3232 (frozen)
Encoder heads1616 (frozen)
Encoder dim $D$12801280 (frozen)
Patch size$2 \times 16 \times 16$ (tubelet)$2 \times 16 \times 16$
Phase 1 predictor layers12
Phase 1 predictor dim384
Dynamics predictor layers6–12 Transformer layers
Dynamics predictor dim1280 (matches encoder)
Action dim $d_a$Task-dependent (e.g., 7 for robotic arm)
OptimizerAdamWAdamW
Learning rate$1.5 \times 10^{-4}$$1 \times 10^{-4}$
LR scheduleCosine decay with warmupCosine decay with warmup
Warmup epochs4010
Weight decay0.050.01
Batch size2048–3072 clips256–512 transitions
Training epochs (Phase 1)300–600
Training steps (Phase 2)50K–200K
EMA momentum $\tau$0.996 → 1.0 (cosine)fixed at 1.0 (frozen) or 0.9999
Masking ratio (Phase 1)~75–90%N/A
Number of target blocks4–8N/A
Input resolution224×224, 16 frames224×224, per-frame
GPUs64–128 A100 (Phase 1)8–32 A100 (Phase 2)
Planning: CEM iterations $I$3–5
Planning: candidates $K$500–1000
Planning: horizon $H$5–20 steps
Planning: elite fractiontop 10%

Note: No public code repository is available for V-JEPA 2 at the time of writing. The hyperparameters above are drawn from the paper and supplementary material; some Phase 2 details (e.g., exact dynamics predictor depth, CEM configuration) vary across experimental settings.

6. Algorithm

Algorithm 1: V-JEPA 2 — Phase 1 Training (V-JEPA Pretraining)
Input: Video dataset $\mathcal{D}_{\text{video}}$; encoder $f_\theta$; target encoder $f_\xi$ (initialized from $\theta$); predictor $p_\psi$; EMA schedule $\tau(t)$; learning rate schedule $\eta(t)$; total steps $T_1$
Output: Pretrained encoder parameters $\theta^*$
1 Initialize $\xi \leftarrow \theta$
2 for $t = 1$ to $T_1$ do
3 Sample mini-batch of video clips $\{x^{(b)}\}_{b=1}^B$ from $\mathcal{D}_{\text{video}}$
4 for each clip $x^{(b)}$ do
5 Sample $M$ spatiotemporal target blocks; let $\mathcal{M}^{(b)}$ = union of masked indices
6 Context set: $\mathcal{C}^{(b)} = \{1, \ldots, N\} \setminus \mathcal{M}^{(b)}$
7 Encode context: $\{h_j\}_{j \in \mathcal{C}^{(b)}} = f_\theta(\{x^{(b)}_j\}_{j \in \mathcal{C}^{(b)}})$
8 Compute targets (no grad): $\{z^*_i\}_{i \in \mathcal{M}^{(b)}} = \text{sg}(f_\xi(x^{(b)}))[\mathcal{M}^{(b)}]$
9 Predict: $\{\hat{z}_i\}_{i \in \mathcal{M}^{(b)}} = p_\psi(\{h_j\}_{j \in \mathcal{C}^{(b)}}, \{m_i\}_{i \in \mathcal{M}^{(b)}})$
10 end for
11 Compute loss: $\mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \frac{1}{|\mathcal{M}^{(b)}|} \sum_{i \in \mathcal{M}^{(b)}} \| \hat{z}_i - z^*_i \|_2^2$
12 Update $\theta, \psi$: $(\theta, \psi) \leftarrow (\theta, \psi) - \eta(t) \cdot \nabla_{(\theta, \psi)} \mathcal{L}$   (AdamW)
13 Update target encoder: $\xi \leftarrow \tau(t) \cdot \xi + (1 - \tau(t)) \cdot \theta$
14 end for
15 return $\theta^* = \theta$

Algorithm 2: V-JEPA 2 — Phase 2 Training (Action-Conditioned World Model)
Input: Video-action dataset $\mathcal{D}_{\text{va}} = \{(x_t, a_t, x_{t+1})\}$; pretrained encoder $f_{\theta^*}$ (frozen); target encoder $f_\xi$; action encoder $e_\alpha$; dynamics predictor $g_\phi$; learning rate $\eta$; total steps $T_2$
Output: Trained dynamics model parameters $(\phi^*, \alpha^*)$
1 Initialize $\xi \leftarrow \theta^*$ (from Phase 1); freeze $\xi$ or set high momentum
2 for $t = 1$ to $T_2$ do
3 Sample mini-batch $\{(x_t^{(b)}, a_t^{(b)}, x_{t+1}^{(b)})\}_{b=1}^B$ from $\mathcal{D}_{\text{va}}$
4 Encode current state (no grad): $s_t^{(b)} = \text{sg}(f_{\theta^*}(x_t^{(b)}))$
5 Encode action: $e_a^{(b)} = e_\alpha(a_t^{(b)})$
6 Predict next state: $\hat{s}_{t+1}^{(b)} = g_\phi(s_t^{(b)}, e_a^{(b)})$
7 Compute target (no grad): $s^*_{t+1}{}^{(b)} = \text{sg}(f_\xi(x_{t+1}^{(b)}))$
8 Compute loss: $\mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \| \hat{s}_{t+1}^{(b)} - s^*_{t+1}{}^{(b)} \|_2^2$
9 Update: $(\phi, \alpha) \leftarrow (\phi, \alpha) - \eta \cdot \nabla_{(\phi, \alpha)} \mathcal{L}$   (AdamW)
10 (Optional) Update $\xi$: $\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta^*$   if using slow EMA
11 end for
12 return $(\phi^*, \alpha^*)$

Algorithm 3: V-JEPA 2 — Planning via Model-Predictive Control (CEM)
Input: Current observation $x_\text{curr}$; goal observation $x_\text{goal}$; encoder $f_{\theta^*}$; dynamics model $g_{\phi^*}$; action encoder $e_{\alpha^*}$; horizon $H$; num candidates $K$; elite fraction $\rho$; CEM iterations $I$; action bounds $[a_\text{min}, a_\text{max}]$
Output: Best first action $a^*_0$
1 $s_0 \leftarrow f_{\theta^*}(x_\text{curr})$;   $s_\text{goal} \leftarrow f_{\theta^*}(x_\text{goal})$
2 Initialize action distribution: $\mu \leftarrow \mathbf{0}_{H \times d_a}$, $\sigma \leftarrow \mathbf{1}_{H \times d_a}$
3 for $i = 1$ to $I$ do
4 Sample $K$ action sequences: $A^{(k)} \sim \text{clip}(\mathcal{N}(\mu, \text{diag}(\sigma^2)),\, a_\text{min},\, a_\text{max})$ for $k=1,\ldots,K$
5 for each candidate $k = 1, \ldots, K$ do
6 $\hat{s} \leftarrow s_0$
7 for $h = 0$ to $H-1$ do
8 $\hat{s} \leftarrow g_{\phi^*}(\hat{s},\, e_{\alpha^*}(A^{(k)}_h))$
9 end for
10 $\text{cost}_k \leftarrow \| \hat{s} - s_\text{goal} \|_2^2$
11 end for
12 Select elite set $\mathcal{E}$: top-$\lfloor \rho K \rfloor$ candidates by lowest cost
13 Refit: $\mu \leftarrow \text{mean}(\{A^{(k)}\}_{k \in \mathcal{E}})$;   $\sigma \leftarrow \text{std}(\{A^{(k)}\}_{k \in \mathcal{E}})$
14 end for
15 $a^*_0 \leftarrow \mu[0]$   (first action of the final mean sequence)
16 return $a^*_0$

7. Training

Step-by-Step: One Phase 1 Training Iteration

  1. Sample clip. Draw a batch of $B$ video clips from the training set. Each clip consists of $T = 16$ frames at $224 \times 224$ resolution, yielding $B \times T \times H \times W \times 3$ raw pixels.
  2. Patchify. Each clip is divided into $N = \frac{T}{t_p} \times \frac{H}{16} \times \frac{W}{16} = 8 \times 14 \times 14 = 1568$ spatiotemporal tubes (patch size $2 \times 16 \times 16$). Shape: $B \times 1568 \times (2 \cdot 16 \cdot 16 \cdot 3) = B \times 1568 \times 1536$.
  3. Mask. Sample $M$ target blocks per clip. Union of target indices $\mathcal{M}$ covers ~80% of patches ($\sim$1254 patches). Context set $\mathcal{C}$ contains ~20% ($\sim$314 patches).
  4. Encode context. Feed context patches through the online encoder $f_\theta$: shape $B \times |\mathcal{C}| \times D = B \times 314 \times 1280$. Cost: self-attention over 314 tokens per clip.
  5. Encode full video (target encoder, no grad). Feed the full clip through the EMA target encoder $f_\xi$: shape $B \times 1568 \times 1280$. Extract target tokens at masked positions: $B \times |\mathcal{M}| \times 1280$.
  6. Predict. The predictor $p_\psi$ takes context encoder output (projected to $D_p = 384$) and mask tokens at target positions. It outputs predictions: $B \times |\mathcal{M}| \times D$. Total predictor input length: $|\mathcal{C}| + |\mathcal{M}| = N = 1568$ tokens.
  7. Compute loss. L2 distance between predicted and target representations at masked positions. Mean over masked tokens and batch. Scalar loss.
  8. Backpropagate. Gradients flow through predictor $p_\psi$ and online encoder $f_\theta$ only. Target encoder $f_\xi$ receives no gradient (stop-gradient).
  9. Update. AdamW step on $\theta$ and $\psi$ with learning rate $\eta(t)$ (cosine schedule).
  10. EMA update. $\xi \leftarrow \tau(t) \cdot \xi + (1 - \tau(t)) \cdot \theta$ with cosine momentum schedule.

Step-by-Step: One Phase 2 Training Iteration

  1. Sample transition. Draw a batch of $B$ transitions $(x_t, a_t, x_{t+1})$ from the video-action dataset.
  2. Encode current state (no grad). $s_t = \text{sg}(f_{\theta^*}(x_t))$. Shape: $B \times N \times D$ (or $B \times D$ if globally pooled).
  3. Encode target next state (no grad). $s^*_{t+1} = \text{sg}(f_\xi(x_{t+1}))$. Same shape.
  4. Encode action. $e_a = e_\alpha(a_t)$. Shape: $B \times D$.
  5. Fuse and predict. Combine $s_t$ and $e_a$ (e.g., concatenate action token to state tokens). Pass through dynamics Transformer $g_\phi$. Output: $\hat{s}_{t+1}$. Shape: $B \times N \times D$.
  6. Compute loss. L2 between $\hat{s}_{t+1}$ and $s^*_{t+1}$, averaged over tokens and batch.
  7. Backpropagate. Gradients flow through $g_\phi$ and $e_\alpha$ only. Encoder $f_{\theta^*}$ and target encoder $f_\xi$ are frozen.
  8. Update. AdamW step on $\phi$ and $\alpha$.

Training Architecture Diagram (Phase 2 Focus)

V-JEPA 2 — Phase 2 Training: Gradient Flow & Dimensions Frame x_t 224×224×3 Action a_t ℝ^{d_a} Frame x_{t+1} 224×224×3 (target) Encoder f_θ* ViT-H (FROZEN) sg() — no gradient Action Enc e_α MLP (TRAINABLE) d_a → D=1280 Target Enc f_ξ EMA (FROZEN) sg() — no gradient s_t: B×N×1280 e_a: B×1280 s*_{t+1}: B×N×1280 State-Action Fusion concat/add → B×(N+1)×D Dynamics Pred g_ϕ Transformer (TRAINABLE) 6-12 layers, dim=1280 ŝ_{t+1}: B×N×1280 L2 Loss ‖ŝ_{t+1} − s*_{t+1}‖² ∇ flows to: g_ϕ (predictor) and e_α (action enc) ONLY
Figure 3. V-JEPA 2 Phase 2 training with explicit gradient flow and tensor dimensions. Green solid borders indicate trainable components ($g_\phi$, $e_\alpha$). Dashed borders indicate frozen/EMA components. The encoder $f_{\theta^*}$ and target encoder $f_\xi$ receive no gradients. Only the dynamics predictor and action encoder are optimized.

8. Inference

V-JEPA 2 supports three distinct inference modes, each utilizing different subsets of the trained components:

8.1 Video Understanding (Encoder Only)

For standard video recognition tasks, only the pretrained encoder $f_{\theta^*}$ is used. A video clip is passed through the encoder, and the resulting representations are used for downstream tasks via:

  • Linear probing: A single linear layer is trained on top of the frozen encoder features (global average pooled to $\mathbb{R}^D$). This evaluates representation quality without fine-tuning the encoder.
  • Attentive probing: A lightweight attention-pooling head is trained on the frozen encoder features, allowing the probe to attend to different spatial-temporal positions for different tasks.
  • Full fine-tuning: The encoder is unfrozen and fine-tuned end-to-end with a task-specific head, achieving maximum downstream performance.

8.2 Future State Prediction

Given a current frame $x_t$ and a sequence of actions $(a_t, a_{t+1}, \ldots, a_{t+H-1})$, the model predicts the latent state at time $t+H$ by autoregressive rollout:

  1. $s_t = f_{\theta^*}(x_t)$
  2. For $k = 0, \ldots, H-1$: $\hat{s}_{t+k+1} = g_{\phi^*}(\hat{s}_{t+k}, e_{\alpha^*}(a_{t+k}))$

The predicted latent states can be evaluated against ground-truth encodings to measure prediction accuracy, or used for downstream reasoning tasks.

8.3 Goal-Conditioned Planning

The full planning pipeline (Algorithm 3) is used for embodied tasks. At each time step, the agent:

  1. Encodes the current observation and goal image;
  2. Runs CEM-based planning using the world model;
  3. Executes the first action of the best plan;
  4. Observes the new state and re-plans (receding horizon).

Inference Pipeline Diagram

V-JEPA 2 — Inference Pipeline (All Three Modes) Mode A: Understanding Video x Enc f_θ* frozen Probe linear/attn Classification K400, SSv2, etc. Mode B: Prediction x_t a_t f_θ* e_α* Dynamics g_ϕ* ŝ_{t+1} ∈ ℝ^D Mode C: Planning (MPC) x_curr x_goal f_θ* f_θ* s_0 s_goal CEM Loop (I iters) 1. Sample K action seqs 2. Rollout g_ϕ* H steps each ŝ ← g_ϕ*(ŝ, e_α*(a_h)) 3. Score: ‖ŝ_H − s_goal‖² 4. Refit to top-ρK elites Execute a*_0 first action of best plan re-plan Environment x_curr ← observe()
Figure 4. V-JEPA 2 inference pipeline across all three modes. Mode A: frozen encoder + learned probe for video classification. Mode B: encoder + action encoder + dynamics predictor for future state prediction. Mode C: full MPC planning loop with CEM-based action optimization, autoregressive world model rollouts, and receding-horizon re-planning.

9. Results & Benchmarks

9.1 Video Understanding

V-JEPA 2 inherits the strong video understanding capabilities of V-JEPA, with improvements from scaling and extended pretraining. The encoder is evaluated on standard benchmarks via frozen-encoder probing and full fine-tuning.

BenchmarkProtocolV-JEPA (ViT-H)V-JEPA 2 (ViT-H)Previous SOTA
Kinetics-400Attentive probe82.083.582.1 (VideoMAE v2)
Kinetics-400Fine-tune84.285.885.6 (InternVideo2)
Something-Something v2Attentive probe71.473.270.8 (VideoMAE v2)
Something-Something v2Fine-tune73.275.674.1 (InternVideo2)
ImageNet-1K (frame)Linear probe80.381.280.1 (DINOv2-g)

Something-Something v2 (SSv2) is particularly noteworthy as it requires temporal reasoning (distinguishing "pushing X left" from "pushing X right"), and V-JEPA 2's strong SSv2 performance suggests the action-conditioned training also improves temporal understanding in the encoder.

9.2 Physical Prediction and World Modeling

V-JEPA 2 is evaluated on tasks that require predicting future physical states conditioned on actions, using environments from robotic manipulation and navigation benchmarks.

Task / BenchmarkMetricPixel-Space WMV-JEPA 2 (latent)
Block pushing (1-step)Cosine similarity (latent)0.94
Block pushing (5-step)Cosine similarity (latent)0.87
Block pushing (10-step)Cosine similarity (latent)0.79
Robotic manipulationSuccess rate (planning)32%58%
Robotic navigationGoal-reaching rate41%67%

The prediction quality degrades gracefully with horizon length, as expected for autoregressive rollouts. Critically, V-JEPA 2's latent-space world model substantially outperforms pixel-space generative world models on planning tasks while requiring approximately 100× fewer FLOPs per rollout step.

9.3 Planning Performance

Planning MethodWorld ModelSuccess Rate (%)Rollout Time (ms)
Random actions5
Behavioral cloning422
CEM + pixel-space WMVideo diffusion32~5000
CEM + V-JEPA 2Latent dynamics58~50
CEM + V-JEPA 2 (ViT-G)Latent dynamics63~80

The planning results highlight the dual advantage of V-JEPA 2: (1) substantially higher success rates than pixel-space world models, because latent predictions focus on task-relevant state and are less susceptible to compounding pixel-level errors; (2) dramatically faster rollouts (50ms vs 5000ms), making real-time planning feasible.

9.4 Ablation Studies

AblationPrediction Quality (cos sim)Planning Success (%)
Full V-JEPA 20.9458
No Phase 1 pretraining (random init)0.7128
Phase 1 only (no action conditioning)— (no action input)— (cannot plan)
Pixel-space prediction target0.82*39
No EMA (online encoder as target)0.68 (collapses)15
Larger predictor (D_p = 1280)0.9152
Action: addition fusion (vs concat)0.9255
Multi-step training (H=5)62

*Pixel-space prediction quality measured differently (SSIM); not directly comparable.

Key takeaways from ablations:

  • Phase 1 pretraining is critical. Without it, prediction quality drops by 24% and planning success halves. The pretrained representation space provides a structured, semantically meaningful target for the dynamics model.
  • EMA target encoder is essential. Removing the EMA and using the online encoder as target leads to representation collapse. This is consistent with the BYOL/V-JEPA lineage.
  • Predictor capacity matters. A larger dynamics predictor slightly hurts planning performance, likely because it can memorize state-action pairs rather than learning generalizable dynamics. The bottleneck effect is consistent with Phase 1 findings.
  • Multi-step training helps planning. Training the dynamics model with multi-step rollouts (H=5) during training improves planning success by 4%, because it exposes the model to its own compounding errors.

10. Connection to JEPA Family

Lineage

V-JEPA 2 sits at a critical inflection point in the JEPA family tree. The lineage is:

  1. JEPA (LeCun, 2022) — The conceptual framework: predict in latent space, not pixel space; use an energy-based formulation; build toward world models that can plan.
  2. I-JEPA (Assran et al., 2023) — First concrete instantiation: image-level masked prediction in latent space with EMA targets and a narrow predictor. Establishes the architectural template.
  3. V-JEPA (Bardes et al., 2024) — Extends I-JEPA to video with spatiotemporal masking. Demonstrates that latent prediction outperforms pixel reconstruction (e.g., VideoMAE) for video understanding.
  4. V-JEPA 2 (Assran, Bardes, Fan et al., 2025) — The critical step: extends V-JEPA from a passive representation model to an active world model by adding action conditioning, future state prediction, and planning. This realizes the full vision of the original JEPA position paper.

Parallel branches of the JEPA family have explored other modalities (Audio-JEPA for speech/audio, Point-JEPA and 3D-JEPA for point clouds, MC-JEPA for motion compensation in video). V-JEPA 2 is, however, the first to close the loop from representation learning to action-conditioned world modeling and planning within the JEPA framework.

Key Novelty: From Perception to Agency

Prior JEPA variants are perception models: they learn to represent inputs but cannot reason about the consequences of actions. V-JEPA 2 is the first JEPA variant that is an agency model: it can answer the question "what will happen if I do X?" and use that answer to select the best X. This transforms JEPA from a self-supervised learning framework into a self-supervised world-modeling framework, realizing the central vision articulated in LeCun's original "A Path Towards Autonomous Machine Intelligence" position paper (2022).

The specific architectural contributions that enable this are: (1) a two-phase training procedure that cleanly separates representation learning from dynamics learning; (2) an action-conditioned latent dynamics model that predicts state transitions in the pretrained representation space; (3) a planning module that leverages the computational efficiency of latent-space rollouts for real-time model-predictive control.

Influence and Future Directions

V-JEPA 2 establishes a template for building world models within the JEPA framework. Its influence is likely to be felt in several directions:

  • Hierarchical world models: Combining V-JEPA 2's dynamics model with H-JEPA's hierarchical abstraction could enable planning at multiple temporal scales—high-level strategy (go to the kitchen) decomposed into low-level motor commands.
  • Cross-modal world models: The action-conditioning approach is modality-agnostic. Applying V-JEPA 2's framework to Audio-JEPA or Point-JEPA could yield world models for audio-interactive or 3D-physical environments.
  • Scaling laws for world models: The paper's results with ViT-H and ViT-G suggest that scaling the encoder further improves both understanding and planning, motivating investigation of scaling behavior for latent world models.
  • Integration with language: Conditioning the world model on language instructions (rather than low-level actions) could connect V-JEPA 2 to instruction-following and embodied language understanding.

11. Summary

Key Takeaway

V-JEPA 2 demonstrates that a single self-supervised architecture, built on the JEPA principle of latent-space prediction, can simultaneously serve as a state-of-the-art video understanding model, an accurate physical state predictor, and an effective planner for embodied tasks. By adding action-conditioned dynamics prediction on top of V-JEPA's pretrained representations, V-JEPA 2 transforms a passive video encoder into an active world model—one that can internally simulate the consequences of actions and plan before acting.

Main contribution: The two-phase training paradigm (Phase 1: self-supervised video representation learning; Phase 2: action-conditioned latent dynamics) cleanly separates "learning to see" from "learning to predict," while the shared representation space ensures coherence between the two. Planning via CEM over latent rollouts achieves substantially higher success rates than pixel-space world models at a fraction of the computational cost. V-JEPA 2 is the first JEPA-family model to close the perception-to-action loop, realizing the world-model vision of the original JEPA framework.

12. References

  1. Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Ballas, N., LeCun, Y., & Rabbat, M. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv:2506.09985.
  2. Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv:2404.16930.
  3. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
  4. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Technical Report, Meta AI.
  5. Grill, J.-B., Strub, F., Altché, F., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
  6. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021. (DINO)
  7. Oquab, M., Darcet, T., Moutakanni, T., et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. TMLR 2024.
  8. Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022.
  9. Wang, L., Huang, B., Zhao, Z., et al. (2023). VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. CVPR 2023.
  10. Wang, Y., Li, K., Li, Y., et al. (2024). InternVideo2: Scaling Foundation Models for Multimodal Video Understanding. arXiv:2403.15377.
  11. Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. ICML 2019. (PlaNet/Dreamer lineage)
  12. Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. ICLR 2021. (DreamerV2)
  13. Rubinstein, R. Y. & Kroese, D. P. (2004). The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer.
  14. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021. (ViT)
  15. Loshchilov, I. & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019. (AdamW)