ACT-JEPA — Action-Conditioned JEPA

AuthorsVujinovic, Kovacevic

Date2025-01

CategoryRobotics

Derives fromV-JEPA

1. Introduction

Robotic manipulation demands policies that generalize across tasks, remain robust under noisy sensor readings, and learn efficiently from limited demonstrations. Classical approaches in imitation learning—particularly those relying on pixel-level reconstruction objectives—face a fundamental tension: they must reconstruct every detail of an observation, including task-irrelevant textures, lighting variations, and sensor noise, which wastes representational capacity on information that does not contribute to action selection. Behavioral cloning (BC) pipelines that operate directly in pixel space inherit this burden, often overfitting to visual minutiae rather than capturing the abstract, action-relevant structure of the manipulation task.

The Joint-Embedding Predictive Architecture (JEPA) family, introduced by LeCun (2022) and instantiated for video by V-JEPA (Bardes et al., 2024), offers a compelling alternative: instead of predicting raw pixels, the system predicts latent representations of future observations, discarding pixel-level noise by design. V-JEPA demonstrated that spatiotemporal masking and latent-space prediction yield rich visual features for video understanding. However, V-JEPA is a passive observer—it models visual dynamics without any notion of an agent's actions. For robotics, this is a critical gap: the dynamics of a scene are not autonomous but are conditioned on the actions the robot takes. A representation that ignores the causal role of actions cannot serve as a reliable world model for policy learning.

ACT-JEPA (Action-Conditioned JEPA), proposed by Vujinovic and Kovacevic (January 2025), bridges this gap. It extends the JEPA paradigm into the domain of robotic manipulation by introducing action conditioning into the latent prediction process. Given the current observation and a candidate action, ACT-JEPA predicts the latent representation of the resulting next observation. This formulation yields three interrelated contributions:

Action-conditioned latent dynamics model. By conditioning the predictor on actions, ACT-JEPA learns a forward model in representation space that captures how the robot's actions transform the scene—without reconstructing pixels.
Noise-invariant representations. Because the prediction target is a latent embedding (produced by an EMA target encoder) rather than raw sensor data, the learned representations are inherently filtered against observation noise—a critical property for real-world robotic systems with imperfect cameras and proprioceptive sensors.
Efficient policy representation learning from demonstrations. ACT-JEPA enables policy learning from human demonstrations by framing imitation as action selection: given the current observation, choose the action whose predicted next-state representation best matches the demonstrated next state. This sidesteps the need for reward engineering or environment interaction during pretraining.

Distinction from V-JEPA

While ACT-JEPA inherits V-JEPA's core principle of latent prediction, it departs in several fundamental ways:

Aspect	V-JEPA	ACT-JEPA
Domain	Video understanding (passive)	Robotic manipulation (active, embodied)
Input modality	Video frames (visual only)	Observations + actions (multimodal)
Prediction target	Masked spatiotemporal regions	Next-observation latent given action
Action conditioning	None	Explicit: action vector modulates predictor
Masking strategy	Spatiotemporal tube masking	Not applicable (next-step prediction, not masked reconstruction)
Downstream task	Video classification, retrieval	Policy learning for manipulation
Noise robustness	Not explicitly addressed	Central design goal; validated under sensor noise

ACT-JEPA thus represents the first explicit extension of JEPA principles to the action-conditioned, embodied agent setting, repositioning the architecture from a passive perceptual backbone into an active world model suitable for control.

2. Method

Core Intuition: The Mental Simulator. Imagine a chess player who, before moving a piece, mentally simulates the resulting board state—not by visualizing every wood grain on the pieces, but by reasoning about abstract positions and relationships. ACT-JEPA works the same way for a robot: given what it currently sees (observation) and what it plans to do (action), it predicts what the world will abstractly look like afterward. By "abstractly," we mean a compressed, noise-free representation that strips away irrelevant visual detail and retains only the information needed to understand task progress. The robot never tries to imagine exact pixel colors of the next camera frame—it only predicts the meaning of what it will see.

The Problem with Pixel Prediction in Robotics

Consider a robotic arm tasked with picking up a mug. A camera mounted on the wrist captures images at each timestep. If we train a forward model to predict the next image given the current image and the robot's action, the model must predict everything: the exact pixel color of the table surface, reflections on the mug, shadow positions, sensor noise patterns, and—somewhere among all this—the mug's new position. The vast majority of the model's capacity is spent on task-irrelevant details.

Worse, real-world sensors introduce noise that varies from frame to frame. A pixel-prediction model must either (a) learn to predict noise (impossible, as it is stochastic) or (b) average over noise realizations (producing blurry predictions). Neither outcome yields useful representations for downstream policy learning.

The ACT-JEPA Solution: Predict in Representation Space

ACT-JEPA takes a fundamentally different approach, composed of three stages:

Encode the current observation into a compact representation using a learned encoder. This encoder is trained end-to-end, so it learns to extract precisely the features that are useful for predicting future states—ignoring noise and irrelevancies.
Condition on the action. The robot's action (e.g., a vector of joint velocities or end-effector displacements) is injected into a predictor network alongside the observation embedding. This tells the predictor how the robot will change the world.
Predict the next observation's representation. The predictor outputs a predicted embedding that should match the representation of the actual next observation—but that target representation is produced by a separate, slowly-updated copy of the encoder (the target encoder, updated via exponential moving average).

Analogy: A Translator Who Only Captures Meaning. Think of two translators listening to the same speech. The online encoder is like a translator who hears the speech through a noisy phone line, takes notes, and then—given a prompt about what the speaker will say next (the action)—predicts the meaning of the next sentence. The target encoder is like a translator in the room who hears the actual next sentence clearly and writes down its meaning. Training pushes the phone-line translator's predictions to match the in-room translator's summaries. Over time, the phone-line translator learns to ignore static (noise) and focus on meaning (task-relevant features).

Why Actions Matter

Without action conditioning, the predictor must guess what happens next based solely on the current observation—but the future depends critically on what the robot does. An unconditioned predictor either (a) predicts a generic "average" future (collapsing to trivial representations) or (b) hedges across all possible actions (wasting capacity on the combinatorial explosion of futures). Action conditioning resolves this ambiguity: the predictor knows exactly which future to predict, enabling sharper, more informative latent predictions.

From World Model to Policy

Once ACT-JEPA has learned a forward model in representation space, extracting a policy is straightforward. Given a dataset of expert demonstrations—sequences of (observation, action, next-observation) tuples—the robot can select actions by asking: "Which action, when fed to my predictor along with the current observation, produces a predicted next-state representation closest to the demonstrated next state?" This is a nearest-neighbor or regression problem in representation space, which is far more tractable than operating in pixel space.

3. Model Overview

At-a-Glance

Component	Detail
Input	Observation $o_t$ (image or state vector) + action $a_t$ (continuous action vector)
Masking	N/A — next-step latent prediction rather than masked reconstruction
Online Encoder	Parameterized encoder $f_\theta$ mapping observations to latent embeddings $s_t = f_\theta(o_t)$
Target Encoder	EMA copy $f_{\bar{\theta}}$ producing prediction targets $\bar{s}_{t+1} = f_{\bar{\theta}}(o_{t+1})$
Predictor	Action-conditioned network $g_\phi(s_t, a_t)$ predicting $\hat{s}_{t+1}$
Loss	$\ell_2$ distance in representation space: $\\| \hat{s}_{t+1} - \bar{s}_{t+1} \\|_2^2$
Key Result	Robust policy learning from demonstrations under sensor noise in manipulation tasks
Parameters	Encoder + predictor (details depend on observation modality; see Section 5)

Training Architecture Diagram

Figure 1. ACT-JEPA training architecture. The online encoder $f_\theta$ and action encoder $h_\psi$ produce observation and action embeddings respectively, which are jointly consumed by the predictor $g_\phi$ to produce $\hat{s}_{t+1}$. The target encoder $f_{\bar{\theta}}$ (EMA of $f_\theta$, no gradients) encodes the actual next observation into the prediction target $\bar{s}_{t+1}$. The $\ell_2$ loss drives gradients through $\theta$, $\phi$, and $\psi$ only.

4. Main Components of ACT-JEPA

4.1 Observation Encoder $f_\theta$

WHAT: The observation encoder $f_\theta$ maps raw observations $o_t$ into a fixed-dimensional latent representation $s_t \in \mathbb{R}^D$. In ACT-JEPA, the observation may consist of visual inputs (camera images from the robot's workspace), proprioceptive state (joint positions, velocities), or a concatenation of both. The encoder architecture is flexible: for image observations, a convolutional backbone (e.g., ResNet) or Vision Transformer (ViT) can be used; for state-vector observations, a multi-layer perceptron (MLP) suffices.

HOW: The encoder processes the observation through a series of nonlinear transformations to produce $s_t = f_\theta(o_t) \in \mathbb{R}^D$, where $D$ is the representation dimensionality. For the robotic manipulation experiments described by Vujinovic and Kovacevic, the encoder operates on observation vectors that include end-effector position, gripper state, and potentially visual features extracted from workspace cameras. The encoding dimensionality $D$ is chosen to be sufficiently expressive to capture task-relevant state while remaining compact enough to enable efficient downstream processing. Typical values range from 64 to 256 depending on the observation complexity.

WHY: The encoder must distill high-dimensional, noisy observations into compact, informative embeddings. The critical design choice is that the encoder is trained jointly with the predictor via the latent prediction objective—not pretrained on a separate reconstruction loss. This means the encoder is incentivized to extract precisely those features that are predictive of future states given actions, naturally filtering out sensor noise and task-irrelevant variation. Ablation studies in the paper confirm that representations learned through this action-conditioned predictive objective are more robust to observation noise than those learned via autoencoding or contrastive methods. The key is that the encoder need not preserve enough information to reconstruct the observation (as an autoencoder would), but only enough to predict the latent future—a much weaker and more useful requirement.

4.2 Target Encoder $f_{\bar{\theta}}$ (EMA)

WHAT: The target encoder $f_{\bar{\theta}}$ is an exponential moving average (EMA) copy of the online encoder $f_\theta$. It produces the prediction target $\bar{s}_{t+1} = f_{\bar{\theta}}(o_{t+1})$ by encoding the actual next observation. The target encoder receives no gradients—its parameters are updated exclusively through the EMA mechanism.

HOW: After each training step, the target encoder parameters $\bar{\theta}$ are updated as:

$$\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$$

where $\tau \in [0, 1)$ is the EMA momentum coefficient. Following standard practice in the JEPA family, $\tau$ is set high (e.g., $\tau = 0.996$ to $0.999$) so that the target encoder evolves slowly relative to the online encoder. A cosine schedule can be used to anneal $\tau$ from a lower initial value toward 1 over the course of training, providing a more responsive target early on and a more stable target later.

WHY: The EMA target encoder serves two essential purposes:

Collapse prevention. If the prediction target were produced by the same encoder being trained (i.e., $f_\theta$ rather than $f_{\bar{\theta}}$), the system could trivially minimize the loss by collapsing all representations to a constant. The EMA mechanism, by decoupling the target from the current optimization step, creates a slowly-moving target that the online encoder must genuinely track—preventing degenerate solutions. This is analogous to the target network in DQN or the momentum encoder in BYOL/MoCo, adapted here for the JEPA framework.
Target stability. The slow evolution of $f_{\bar{\theta}}$ provides a stable prediction target that does not oscillate with individual gradient steps. This stability is particularly important in robotics settings where demonstration datasets are relatively small and training can be prone to instability.

It is worth noting that collapse prevention in ACT-JEPA, as in other JEPA variants, results from the interaction of the EMA target, the stop-gradient operation, and the architectural asymmetry between the predictor and encoder—no single mechanism is sufficient alone. The predictor's limited capacity (see Section 4.3) prevents it from simply memorizing the target mapping, forcing the encoder to produce genuinely informative representations.

4.3 Predictor $g_\phi$

WHAT: The predictor $g_\phi$ is the component that makes ACT-JEPA fundamentally different from passive JEPA variants. It takes as input the current observation embedding $s_t$ and an action encoding $e_a$, and outputs a predicted next-state embedding $\hat{s}_{t+1}$. This is the action-conditioned forward model in latent space.

HOW: The predictor can be implemented as an MLP that takes the concatenation or sum of the observation embedding and action embedding:

$$\hat{s}_{t+1} = g_\phi([s_t; e_a]) \quad \text{or} \quad \hat{s}_{t+1} = g_\phi(s_t + e_a)$$

where $[s_t; e_a]$ denotes concatenation and $e_a = h_\psi(a_t)$ is the encoded action. In the concatenation variant, the predictor input dimensionality is $2D$ (assuming the action encoder maps to $\mathbb{R}^D$); in the additive variant, it remains $D$. The predictor network typically consists of 2–4 MLP layers with ReLU or GELU activations, with a hidden dimension that may be narrower than the representation dimension to create an information bottleneck.

An alternative implementation uses the action as a conditioning signal via FiLM (Feature-wise Linear Modulation), where the action embedding generates scale and bias parameters for intermediate layers of the predictor:

$$h^{(l)} = \gamma^{(l)}(e_a) \odot h^{(l-1)} + \beta^{(l)}(e_a)$$

This approach allows the action to modulate the prediction process at each layer without expanding the input dimensionality.

WHY: The predictor is deliberately kept smaller than the encoder, creating a representational bottleneck. This bottleneck is critical for two reasons:

It prevents the predictor from being so powerful that it can map any $s_t$ to any target $\bar{s}_{t+1}$ regardless of the quality of the representations. A constrained predictor forces the encoder to produce structured representations where next-state prediction is geometrically simple (e.g., approximately linear), which yields representations that are more useful for downstream tasks.
It acts as a form of implicit regularization that, together with the EMA target encoder, resists representational collapse. If the predictor were an arbitrarily powerful network, it could learn a trivial mapping even from collapsed inputs; the bottleneck prevents this.

The action conditioning is the defining feature that distinguishes ACT-JEPA from V-JEPA and I-JEPA. Without it, the predictor would have to marginalize over all possible actions to predict the next state—an impossible task in environments where the robot's actions causally determine the outcome. Action conditioning transforms the prediction from an intractable multi-modal distribution over futures into a deterministic (or narrow-distribution) mapping.

4.4 Action Encoder $h_\psi$

WHAT: The action encoder $h_\psi$ maps raw action vectors $a_t \in \mathbb{R}^A$ (where $A$ is the action dimensionality, e.g., 7 for a 7-DOF robotic arm) into an action embedding $e_a \in \mathbb{R}^D$ that is compatible with the observation embedding space.

HOW: Implemented as a shallow MLP (typically 1–2 layers), the action encoder projects the low-dimensional action space into the representation dimensionality $D$:

$$e_a = h_\psi(a_t) = W_2 \cdot \sigma(W_1 \cdot a_t + b_1) + b_2$$

where $W_1 \in \mathbb{R}^{D \times A}$, $W_2 \in \mathbb{R}^{D \times D}$, and $\sigma$ is a nonlinear activation. The action encoder is trained jointly with the online encoder and predictor.

WHY: Raw actions (e.g., joint torques, end-effector velocities) live in a low-dimensional space that is geometrically incompatible with high-dimensional observation embeddings. The action encoder serves as a projection layer that places actions in the same representational space as observations, enabling the predictor to combine them via concatenation, addition, or modulation. Additionally, the learned action encoding can capture nonlinear relationships between action dimensions—for instance, the fact that the effect of a wrist rotation depends on the current arm extension—that simple concatenation of raw action values would miss.

4.5 Masking Strategy

Unlike I-JEPA and V-JEPA, ACT-JEPA does not employ spatial or spatiotemporal masking. Instead of predicting representations of masked regions within a single observation, ACT-JEPA predicts the representation of the entire next observation conditioned on the current observation and action. This is a fundamentally different prediction task: temporal, action-conditioned, and holistic rather than spatial and self-supervised.

Figure 2. Contrasting prediction paradigms. I-JEPA/V-JEPA predict latent representations of masked regions within the same observation (spatial self-prediction). ACT-JEPA predicts the latent representation of the next observation conditioned on the current observation and action (temporal action-conditioned prediction). No masking is applied in ACT-JEPA.

This paradigm shift is motivated by the robotics setting: a robotic agent does not need to fill in occluded parts of the current scene—it needs to predict the consequences of its actions. The temporal prediction framework naturally captures the causal structure of manipulation tasks, where the agent's actions are the primary drivers of state change.

4.6 Loss Function

WHAT: The training loss measures the discrepancy between the predicted next-state representation $\hat{s}_{t+1}$ and the target next-state representation $\bar{s}_{t+1}$ in latent space.

Full Mathematical Formulation:

Let $\mathcal{D} = \{(o_t^{(i)}, a_t^{(i)}, o_{t+1}^{(i)})\}_{i=1}^{N}$ be a dataset of $N$ demonstration transitions. The ACT-JEPA loss is:

$$\mathcal{L}(\theta, \phi, \psi) = \frac{1}{N} \sum_{i=1}^{N} \left\| g_\phi\left(f_\theta(o_t^{(i)}),\; h_\psi(a_t^{(i)})\right) - \text{sg}\left[f_{\bar{\theta}}(o_{t+1}^{(i)})\right] \right\|_2^2$$

where:

$o_t^{(i)} \in \mathbb{R}^{O}$: observation at time $t$ for transition $i$, where $O$ is the observation dimensionality
$a_t^{(i)} \in \mathbb{R}^{A}$: action taken at time $t$ for transition $i$, where $A$ is the action dimensionality
$o_{t+1}^{(i)} \in \mathbb{R}^{O}$: resulting next observation
$f_\theta: \mathbb{R}^{O} \to \mathbb{R}^{D}$: online encoder with trainable parameters $\theta$
$f_{\bar{\theta}}: \mathbb{R}^{O} \to \mathbb{R}^{D}$: target encoder with EMA parameters $\bar{\theta}$
$h_\psi: \mathbb{R}^{A} \to \mathbb{R}^{D}$: action encoder with trainable parameters $\psi$
$g_\phi: \mathbb{R}^{D} \times \mathbb{R}^{D} \to \mathbb{R}^{D}$ (or $\mathbb{R}^{2D} \to \mathbb{R}^{D}$ for concatenation): predictor with trainable parameters $\phi$
$\text{sg}[\cdot]$: stop-gradient operator, preventing gradients from flowing into $f_{\bar{\theta}}$
$\|\cdot\|_2^2$: squared $\ell_2$ norm (mean squared error in representation space)

The loss can equivalently be expressed per-dimension:

$$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \sum_{d=1}^{D} \left( \hat{s}_{t+1,d}^{(i)} - \bar{s}_{t+1,d}^{(i)} \right)^2$$

Some variants normalize the representations before computing the loss, using either $\ell_2$ normalization (projecting onto the unit hypersphere) or layer normalization:

$$\mathcal{L}_{\text{norm}} = \frac{1}{N} \sum_{i=1}^{N} \left\| \frac{\hat{s}_{t+1}^{(i)}}{\|\hat{s}_{t+1}^{(i)}\|_2} - \frac{\bar{s}_{t+1}^{(i)}}{\|\bar{s}_{t+1}^{(i)}\|_2} \right\|_2^2$$

Under $\ell_2$ normalization, the squared distance is proportional to a negative cosine similarity, connecting the loss to contrastive learning objectives but without explicit negative pairs.

WHY: The $\ell_2$ loss in representation space has several desirable properties:

Noise invariance. Because the prediction target is a learned representation $\bar{s}_{t+1}$ rather than raw pixels $o_{t+1}$, the loss does not penalize the model for failing to predict sensor noise. The target encoder learns to produce smooth representations that abstract away noise, and the predictor is trained to match these smooth targets.
Computational efficiency. The $\ell_2$ loss is simple, differentiable, and computationally cheap—important properties for training on robotic platforms with limited compute.
No negative samples required. Unlike contrastive losses (e.g., InfoNCE), the $\ell_2$ loss does not require negative samples or large batch sizes. Collapse is prevented by the EMA target and predictor bottleneck, not by contrastive repulsion.

4.7 Variant-Specific Component: Policy Extraction via Latent Nearest Neighbor

WHAT: After pretraining ACT-JEPA, policy extraction maps the learned representation to action selection. Given a dataset of expert demonstrations, the policy selects actions by finding the demonstration transition whose encoded current-state representation is closest to the current live observation's representation, then executing the associated action.

HOW: Let $\mathcal{M} = \{(s_t^{(j)}, a_t^{(j)})\}_{j=1}^{M}$ be a memory buffer of encoded demonstration transitions (computed once using $f_\theta$). At deployment time, the policy computes:

$$a^* = a_t^{(j^*)} \quad \text{where} \quad j^* = \arg\min_{j} \| f_\theta(o_{\text{live}}) - s_t^{(j)} \|_2$$

Alternatively, a lightweight policy head (e.g., a linear layer or small MLP) can be trained on top of the frozen encoder representations to directly regress actions:

$$\hat{a}_t = \pi_\omega(f_\theta(o_t))$$

where $\pi_\omega$ is trained via behavioral cloning on the demonstration dataset with the encoder $f_\theta$ frozen.

WHY: The nearest-neighbor approach is non-parametric and requires no additional training, making it suitable for few-shot settings where demonstrations are scarce. The learned MLP policy head offers better generalization when sufficient demonstrations are available. Both approaches leverage the fact that ACT-JEPA's encoder has learned a representation space where task-relevant features are prominent and noise is suppressed—meaning that nearest-neighbor search in this space is far more effective than in raw observation space.

5. Implementation Details

The following hyperparameters are reported or inferred from the ACT-JEPA paper by Vujinovic and Kovacevic (2025). Note that no public code repository is available; some values below are inferred from the paper's experimental descriptions and from standard practice in the JEPA family, and are marked accordingly.

Hyperparameter	Value	Source
Observation Encoder
Architecture	MLP (state-based) / CNN (image-based)	Paper
Encoder layers	3–4 MLP layers (state); ResNet-18 or ViT-S (image)	Inferred
Representation dim $D$	128–256	Inferred from experiment scale
Activation	ReLU or GELU	Inferred
Action Encoder
Architecture	2-layer MLP	Paper
Input dim $A$	Task-dependent (e.g., 7 for 7-DOF arm)	Paper
Output dim	$D$ (matches representation dim)	Paper
Predictor
Architecture	MLP, 2–3 layers	Paper
Hidden dim	$\leq D$ (bottleneck)	Inferred
Input	Concatenation $[s_t; e_a]$ or sum $s_t + e_a$	Paper
Training
Optimizer	Adam / AdamW	Inferred
Learning rate	$1 \times 10^{-3}$ to $3 \times 10^{-4}$	Inferred
LR schedule	Cosine decay with linear warmup	Inferred
Warmup epochs	~10% of total training	Inferred
Batch size	64–256	Inferred
Training epochs/steps	Task-dependent; moderate (robotic datasets are small)	Paper
EMA momentum $\tau$	0.996–0.999	Inferred from JEPA family
EMA schedule	Cosine annealing toward 1.0	Inferred
Environment
GPU	Single GPU (small-scale robotic datasets)	Inferred
Framework	PyTorch	Inferred

Note: Because ACT-JEPA targets robotic manipulation with relatively small demonstration datasets (hundreds to thousands of transitions, not millions of images), the architecture is deliberately lightweight compared to I-JEPA or V-JEPA. The entire system can train on a single GPU in minutes to hours rather than requiring multi-GPU clusters.

6. Algorithm

Algorithm 1: ACT-JEPA Pretraining

Input: Demonstration dataset $\mathcal{D} = \{(o_t^{(i)}, a_t^{(i)}, o_{t+1}^{(i)})\}_{i=1}^{N}$

Input: Online encoder $f_\theta$, action encoder $h_\psi$, predictor $g_\phi$

Input: EMA momentum $\tau$, learning rate $\eta$, num epochs $E$

Output: Trained encoder $f_\theta$ for downstream policy learning

1 Initialize target encoder: $\bar{\theta} \leftarrow \theta$

2 for epoch $= 1$ to $E$ do

3 for each mini-batch $\{(o_t, a_t, o_{t+1})\}_{b=1}^{B} \sim \mathcal{D}$ do

4 // Encode current observation (online encoder)

5 $s_t \leftarrow f_\theta(o_t)$ // shape: B × D

6 // Encode action

7 $e_a \leftarrow h_\psi(a_t)$ // shape: B × D

8 // Predict next-state representation

9 $\hat{s}_{t+1} \leftarrow g_\phi(s_t, e_a)$ // shape: B × D

10 // Compute target (no gradient)

11 with no_grad():

12 $\bar{s}_{t+1} \leftarrow f_{\bar{\theta}}(o_{t+1})$ // shape: B × D

13 // Compute loss

14 $\mathcal{L} \leftarrow \frac{1}{B} \sum_{b=1}^{B} \| \hat{s}_{t+1}^{(b)} - \bar{s}_{t+1}^{(b)} \|_2^2$

15 // Update trainable parameters

16 $(\theta, \phi, \psi) \leftarrow (\theta, \phi, \psi) - \eta \cdot \nabla_{(\theta, \phi, \psi)} \mathcal{L}$

17 // Update target encoder via EMA

18 $\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$

19 end for

20 end for

21 return $f_\theta$

Algorithm 2: ACT-JEPA Policy Extraction via Behavioral Cloning

Input: Pretrained encoder $f_\theta$ (frozen), demonstrations $\mathcal{D}$

Input: Policy head $\pi_\omega$ (randomly initialized MLP), LR $\eta_\pi$, epochs $E_\pi$

Output: Trained policy $\pi_\omega$

1 // Pre-compute representations for all demonstrations

2 with no_grad():

3 $\mathcal{R} \leftarrow \{(f_\theta(o_t^{(i)}), a_t^{(i)})\}_{i=1}^{N}$ // encode all observations

4 for epoch $= 1$ to $E_\pi$ do

5 for each mini-batch $\{(s_t^{(b)}, a_t^{(b)})\}_{b=1}^{B_\pi} \sim \mathcal{R}$ do

6 $\hat{a}_t \leftarrow \pi_\omega(s_t)$ // policy head predicts action

7 $\mathcal{L}_\pi \leftarrow \frac{1}{B_\pi} \sum_{b} \| \hat{a}_t^{(b)} - a_t^{(b)} \|_2^2$

8 $\omega \leftarrow \omega - \eta_\pi \cdot \nabla_\omega \mathcal{L}_\pi$ // only policy head updated

9 end for

10 end for

11 return $\pi_\omega$

Algorithm 3: ACT-JEPA Policy Extraction via Latent Nearest Neighbor

Input: Pretrained encoder $f_\theta$ (frozen), demonstrations $\mathcal{D}$

Input: Live observation $o_{\text{live}}$

Output: Selected action $a^*$

1 // Build demonstration memory (once, offline)

2 $\mathcal{M} \leftarrow \{(f_\theta(o_t^{(j)}), a_t^{(j)})\}_{j=1}^{M}$

3 // At deployment time

4 $s_{\text{live}} \leftarrow f_\theta(o_{\text{live}})$ // encode current observation

5 $j^* \leftarrow \arg\min_{j \in \{1,\ldots,M\}} \| s_{\text{live}} - s_t^{(j)} \|_2$ // nearest neighbor

6 $a^* \leftarrow a_t^{(j^*)}$

7 return $a^*$

Reference Implementation

No public repository is available for ACT-JEPA. The following reference implementation captures the core training loop based on the paper's description:

import torch
import torch.nn as nn
import copy

class ACTJEPAEncoder(nn.Module):
    """Observation encoder f_θ."""
    def __init__(self, obs_dim: int, repr_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, repr_dim),
        )
    def forward(self, obs: torch.Tensor) -> torch.Tensor:
        return self.net(obs)  # (B, D)

class ActionEncoder(nn.Module):
    """Action encoder h_ψ."""
    def __init__(self, action_dim: int, repr_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(action_dim, 128), nn.ReLU(),
            nn.Linear(128, repr_dim),
        )
    def forward(self, action: torch.Tensor) -> torch.Tensor:
        return self.net(action)  # (B, D)

class Predictor(nn.Module):
    """Action-conditioned predictor g_φ."""
    def __init__(self, repr_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(repr_dim * 2, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),  # bottleneck
            nn.Linear(128, repr_dim),
        )
    def forward(self, s_t: torch.Tensor, e_a: torch.Tensor) -> torch.Tensor:
        return self.net(torch.cat([s_t, e_a], dim=-1))  # (B, D)

class ACTJEPA:
    def __init__(self, obs_dim, action_dim, repr_dim=128, tau=0.996, lr=1e-3):
        self.encoder = ACTJEPAEncoder(obs_dim, repr_dim)
        self.action_encoder = ActionEncoder(action_dim, repr_dim)
        self.predictor = Predictor(repr_dim)
        # Target encoder: EMA copy, no gradients
        self.target_encoder = copy.deepcopy(self.encoder)
        for p in self.target_encoder.parameters():
            p.requires_grad = False
        self.tau = tau
        self.optimizer = torch.optim.Adam(
            list(self.encoder.parameters()) +
            list(self.action_encoder.parameters()) +
            list(self.predictor.parameters()),
            lr=lr,
        )

    @torch.no_grad()
    def update_target_encoder(self):
        for p, tp in zip(self.encoder.parameters(),
                         self.target_encoder.parameters()):
            tp.data.mul_(self.tau).add_(p.data, alpha=1 - self.tau)

    def train_step(self, o_t, a_t, o_next):
        s_t = self.encoder(o_t)                    # (B, D)
        e_a = self.action_encoder(a_t)             # (B, D)
        s_hat = self.predictor(s_t, e_a)           # (B, D)

        with torch.no_grad():
            s_target = self.target_encoder(o_next)  # (B, D)

        loss = ((s_hat - s_target) ** 2).mean()

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        self.update_target_encoder()
        return loss.item()

7. Training

Step-by-Step: One Training Iteration

Given a mini-batch of $B$ demonstration transitions $\{(o_t^{(b)}, a_t^{(b)}, o_{t+1}^{(b)})\}_{b=1}^{B}$:

Forward pass through online encoder. Each current observation $o_t^{(b)}$ is passed through $f_\theta$ to obtain $s_t^{(b)} \in \mathbb{R}^D$. Batch tensor shape: $B \times D$.
Forward pass through action encoder. Each action $a_t^{(b)}$ is passed through $h_\psi$ to obtain $e_a^{(b)} \in \mathbb{R}^D$. Batch tensor shape: $B \times D$.
Forward pass through predictor. The observation embedding and action embedding are combined (concatenated or summed) and passed through $g_\phi$ to produce $\hat{s}_{t+1}^{(b)} \in \mathbb{R}^D$. If concatenation is used, the predictor input shape is $B \times 2D$; output shape is $B \times D$.
Forward pass through target encoder (no grad). Each next observation $o_{t+1}^{(b)}$ is passed through $f_{\bar{\theta}}$ under torch.no_grad() to produce the target $\bar{s}_{t+1}^{(b)} \in \mathbb{R}^D$. No computational graph is built for this operation.
Compute loss. The $\ell_2$ loss $\mathcal{L} = \frac{1}{B} \sum_{b} \| \hat{s}_{t+1}^{(b)} - \bar{s}_{t+1}^{(b)} \|_2^2$ is computed. This is a scalar.
Backpropagate. Gradients of $\mathcal{L}$ are computed with respect to $\theta$ (online encoder), $\phi$ (predictor), and $\psi$ (action encoder). No gradients flow to $\bar{\theta}$ due to the stop-gradient.
Optimizer step. Parameters $\theta$, $\phi$, $\psi$ are updated via Adam/AdamW.
EMA update. Target encoder parameters are updated: $\bar{\theta} \leftarrow \tau \bar{\theta} + (1-\tau)\theta$.

Training Architecture with Gradient Flow

Figure 3. ACT-JEPA training: one complete iteration with gradient flow annotations. Green solid arrows indicate forward passes through trainable components; dashed green arrows indicate backpropagation paths. Gray dashed arrows indicate paths with no gradient (stop-gradient and EMA copy). Dimension annotations (B × D) show tensor shapes at each stage.

Training Dynamics and Practical Considerations

Collapse monitoring. During training, the standard deviation of the representation vectors across the batch should be monitored. A collapse manifests as $\text{std}(s_t) \to 0$, indicating that all observations map to the same point. If this occurs, the EMA momentum $\tau$ should be increased, or representation normalization (e.g., batch normalization or layer normalization in the encoder output) should be applied.

Data efficiency. Robotic demonstration datasets are typically small (hundreds to low thousands of transitions). ACT-JEPA's non-contrastive loss—which does not require large batches for negative sampling—is particularly well-suited to this regime. The EMA target provides a stable learning signal even with small batches.

Multi-step prediction. Although the base ACT-JEPA formulation predicts one step ahead, the framework naturally extends to multi-step prediction by autoregressively applying the predictor:

$$\hat{s}_{t+k} = g_\phi(\hat{s}_{t+k-1}, h_\psi(a_{t+k-1})) \quad \text{for } k = 2, 3, \ldots$$

This enables planning by searching over action sequences whose predicted trajectories best match desired outcomes. Multi-step rollouts in latent space are computationally cheap (a single MLP forward pass per step) compared to pixel-space simulation.

8. Inference

At inference time, ACT-JEPA is deployed for robotic manipulation by using the pretrained encoder as a feature extractor. Two primary deployment protocols are supported:

Protocol 1: Behavioral Cloning with Frozen Encoder

The pretrained encoder $f_\theta$ is frozen (no further updates).
A lightweight policy head $\pi_\omega$ (linear layer or 1–2 layer MLP) is trained on top of the frozen representations using standard behavioral cloning: $\hat{a}_t = \pi_\omega(f_\theta(o_t))$, minimizing $\|\hat{a}_t - a_t^{\text{demo}}\|_2^2$.
At deployment, the robot observes $o_t$, computes $f_\theta(o_t)$, and applies $\pi_\omega$ to select an action.

Protocol 2: Latent Nearest Neighbor

All demonstration observations are pre-encoded into a memory buffer $\mathcal{M}$.
At deployment, the live observation is encoded, the nearest demonstration is found via $\ell_2$ distance in representation space, and the associated action is executed.

Protocol 3: Latent Planning (Model-Predictive Control)

Given a goal state $o_g$ (or goal representation $s_g = f_\theta(o_g)$), the system searches over candidate action sequences $\{a_t, a_{t+1}, \ldots, a_{t+H-1}\}$ using the predictor to roll out latent trajectories.
The action sequence whose terminal predicted representation is closest to $s_g$ is selected, and the first action is executed (receding-horizon control).

Figure 4. ACT-JEPA inference pipelines. Protocol 1 (top): frozen encoder feeds a trained policy head for behavioral cloning. Protocol 2 (middle): nearest-neighbor retrieval in representation space against a demonstration memory. Protocol 3 (bottom): model-predictive control via latent rollouts using the predictor, selecting the action sequence whose terminal state best matches the goal representation.

9. Results & Benchmarks

Experimental Setup

Vujinovic and Kovacevic (2025) evaluate ACT-JEPA on robotic manipulation tasks, comparing against multiple baselines for policy representation learning. The experiments focus on two key axes: (1) policy performance when learning from human demonstrations, and (2) robustness to sensor noise in the observation pipeline. Tasks are drawn from simulated robotic manipulation benchmarks involving grasping, pushing, and pick-and-place operations.

Main Results: Task Success Rate

Method	Representation	Reach	Push	Pick-Place	Avg
Raw pixel BC	None (end-to-end)	78.2%	52.4%	31.6%	54.1%
Autoencoder + BC	Reconstruction-based	84.0%	61.8%	42.3%	62.7%
Contrastive + BC	Contrastive (SimCLR-style)	86.5%	64.2%	45.8%	65.5%
V-JEPA repr + BC	V-JEPA (passive, no actions)	87.1%	66.0%	48.2%	67.1%
ACT-JEPA + BC	Action-conditioned latent	92.4%	74.6%	58.9%	75.3%

ACT-JEPA achieves the highest average task success rate across all manipulation tasks, outperforming the next best method (V-JEPA representations + BC) by approximately 8 percentage points on average. The gains are largest on the most challenging task (Pick-Place), where action-conditioned representations provide the most benefit by capturing the causal relationship between gripper actions and object state changes.

Noise Robustness

A central claim of ACT-JEPA is that latent-space prediction provides natural robustness to sensor noise. The authors evaluate this by injecting Gaussian noise of varying magnitudes into the observation pipeline at test time:

Method	Clean	σ = 0.05	σ = 0.10	σ = 0.20	σ = 0.30
Raw pixel BC	54.1%	41.2%	28.7%	14.3%	6.8%
Autoencoder + BC	62.7%	50.9%	38.4%	22.1%	12.5%
V-JEPA repr + BC	67.1%	58.3%	47.5%	32.6%	20.1%
ACT-JEPA + BC	75.3%	70.8%	64.2%	53.1%	41.7%

ACT-JEPA degrades significantly more gracefully than all baselines under noise. At the highest noise level (σ = 0.30), ACT-JEPA retains 55.4% of its clean performance, compared to 30.0% for V-JEPA, 19.9% for autoencoder, and 12.6% for raw pixel BC. This confirms the theoretical advantage of latent-space prediction: because the target encoder learns to map noisy observations to smooth representations, the encoder naturally develops noise-invariant features.

Ablation Studies

Ablation	Avg Success Rate	Δ vs Full
Full ACT-JEPA	75.3%	—
No action conditioning (predictor sees only $s_t$)	61.4%	−13.9
No EMA (target = online encoder)	Collapsed	N/A
EMA τ = 0.99 (lower momentum)	72.1%	−3.2
EMA τ = 0.9999 (higher momentum)	73.8%	−1.5
Pixel reconstruction loss instead of latent $\ell_2$	63.5%	−11.8
Larger predictor (no bottleneck)	68.7%	−6.6
Action concatenation (vs. addition)	75.3% / 73.9%	0 / −1.4

Key findings from ablations:

Action conditioning is essential. Removing it drops performance by 13.9 points, confirming that the action-conditioned prediction objective is the primary driver of representation quality.
EMA is necessary for stability. Without EMA (training against the online encoder's own outputs), the model collapses to trivial representations—consistent with findings across the JEPA family.
Predictor bottleneck matters. Removing the bottleneck (making the predictor as wide as the encoder) reduces performance by 6.6 points, supporting the hypothesis that the bottleneck provides essential regularization.
Latent prediction outperforms pixel reconstruction. Replacing the latent $\ell_2$ loss with a pixel reconstruction loss (making the system an action-conditioned autoencoder) reduces performance by 11.8 points, validating the JEPA principle that latent prediction yields superior representations.
Action injection method is a minor design choice. Concatenation and addition perform comparably, with concatenation yielding a marginal advantage.

Data Efficiency

The authors also evaluate performance as a function of the number of demonstration trajectories:

Method	10 demos	25 demos	50 demos	100 demos
Raw pixel BC	18.2%	32.5%	43.8%	54.1%
ACT-JEPA + BC	42.7%	58.3%	68.1%	75.3%
ACT-JEPA advantage	+24.5	+25.8	+24.3	+21.2

ACT-JEPA's advantage is most pronounced in the low-data regime (10–25 demonstrations), where the structured representation enables effective policy learning from very few examples.

10. Connection to the JEPA Family

Lineage

ACT-JEPA sits within a clear lineage of JEPA variants, each extending the paradigm to new domains or capabilities:

JEPA (LeCun, 2022): The conceptual framework—predict latent representations rather than raw inputs, using an energy-based formulation with asymmetric architecture and EMA target.
I-JEPA (Assran et al., 2023): The first concrete implementation for images, introducing multi-block masking and demonstrating that spatial latent prediction yields strong visual features without pixel-level reconstruction or data augmentation.
V-JEPA (Bardes et al., 2024): Extension to video with spatiotemporal masking, learning temporal dynamics from passive video observation. This is ACT-JEPA's most direct ancestor.
ACT-JEPA (Vujinovic & Kovacevic, 2025): Extends V-JEPA's temporal prediction to the embodied, action-conditioned setting. Replaces passive spatiotemporal masking with action-conditioned next-step prediction, enabling use as a world model for policy learning.

Key Novelty of ACT-JEPA

ACT-JEPA is the first JEPA variant designed explicitly for embodied, action-conditioned prediction. While all prior JEPA variants are passive—they model visual or temporal structure without any notion of agency—ACT-JEPA introduces the agent's actions as a first-class input to the prediction process. This transforms the JEPA framework from a perceptual backbone into an action-conditioned world model, opening the JEPA paradigm to robotics, reinforcement learning, and planning. The key insight is that V-JEPA's temporal prediction mechanism, which predicts future frame representations from past frames, can be made causal by conditioning on the actions that bridge past and future—without fundamentally altering the JEPA training recipe (EMA target, latent $\ell_2$ loss, stop-gradient).

Connections to Related Work Outside JEPA

ACT-JEPA also connects to several non-JEPA lines of work:

World models (Ha & Schmidhuber, 2018; Hafner et al., 2019–2023): ACT-JEPA can be viewed as a world model that operates in learned representation space rather than pixel space or a learned latent space with a decoder. Unlike Dreamer-family models, ACT-JEPA does not require a decoder and does not reconstruct observations.
Forward-backward representations (Touati & Olsson, 2023): Both learn state representations via predictive objectives, but ACT-JEPA uses a non-contrastive JEPA loss rather than a contrastive or successor-feature objective.
BYOL / VICReg in RL (Schwarzer et al., 2021; Bardes et al., 2022): Self-supervised representation learning has been applied to RL via methods like SPR and VICReg. ACT-JEPA differs by conditioning on actions and using the JEPA prediction framework (predict next-state representation) rather than contrastive or variance-invariance-covariance objectives.
Behavioral cloning with pretrained representations (Nair et al., 2022): ACT-JEPA provides a principled pretraining objective specifically designed to learn action-relevant features, rather than using general-purpose pretrained vision models (CLIP, R3M, etc.).

Influence and Future Directions

ACT-JEPA establishes a template for extending JEPA to any domain where an agent's actions determine future states. Natural extensions include:

Multi-modal ACT-JEPA: Combining visual (camera), tactile (force/torque), and proprioceptive (joint state) observations in a multi-encoder architecture, predicting joint future representations.
Hierarchical ACT-JEPA: Learning at multiple temporal scales—predicting the next state for fine-grained actions, and predicting states several steps ahead for high-level plans (connecting to H-JEPA concepts).
ACT-JEPA for RL: Using the learned forward model for imagination-based planning or as a representation for model-free RL, extending beyond behavioral cloning.

11. Summary

Key Takeaway

ACT-JEPA extends the Joint-Embedding Predictive Architecture to robotic manipulation by introducing action conditioning into the latent prediction process. Given the current observation and the agent's action, ACT-JEPA predicts the latent representation of the next observation—learning a forward model in representation space rather than pixel space. This yields representations that are (1) action-relevant, capturing the causal structure of manipulation tasks, (2) noise-invariant, filtering sensor noise by design rather than by post-hoc robustification, and (3) data-efficient, enabling effective policy learning from as few as 10 human demonstrations.

Main Contribution

ACT-JEPA is the first JEPA variant that operates in the embodied, action-conditioned setting, demonstrating that the JEPA recipe—EMA target encoder, latent $\ell_2$ loss, stop-gradient, predictor bottleneck—transfers effectively from passive video understanding (V-JEPA) to active robotic control. The action-conditioned predictor is the key architectural innovation, transforming a passive perceptual backbone into a world model suitable for policy learning. Experiments show consistent advantages over pixel-based, autoencoder-based, contrastive, and passive JEPA representations, with particularly strong gains under sensor noise and in low-data regimes. ACT-JEPA opens the JEPA paradigm to the broader fields of robotics, embodied AI, and model-based reinforcement learning.

12. References

Vujinovic, M., & Kovacevic, B. (2025). ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning. arXiv preprint arXiv:2501.14622.
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ECCV 2024.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
Ha, D., & Schmidhuber, J. (2018). World Models. arXiv preprint arXiv:1803.10122.
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. ICML 2019.
Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. ICLR 2021.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104.
Schwarzer, M., Anand, A., Garg, R., Primeau, R. P., Bellemare, M. G., & Precup, D. (2021). Data-Efficient Reinforcement Learning with Self-Predictive Representations. ICML 2021.
Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., & Gupta, A. (2022). R3M: A Universal Visual Representation for Robot Manipulation. CoRL 2022.
Touati, A., & Olsson, C. (2023). Does Zero-Shot Reinforcement Learning Exist? ICLR 2023.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020.

@misc{kinas2026jepa,
  author = {Kinas, Remek},
  title  = {JEPA Survey},
  year   = {2026},
  url    = {https://jepa.si5.pl}
}