1. Introduction
Robotic manipulation demands policies that generalize across tasks, remain robust under noisy sensor readings, and learn efficiently from limited demonstrations. Classical approaches in imitation learning—particularly those relying on pixel-level reconstruction objectives—face a fundamental tension: they must reconstruct every detail of an observation, including task-irrelevant textures, lighting variations, and sensor noise, which wastes representational capacity on information that does not contribute to action selection. Behavioral cloning (BC) pipelines that operate directly in pixel space inherit this burden, often overfitting to visual minutiae rather than capturing the abstract, action-relevant structure of the manipulation task.
The Joint-Embedding Predictive Architecture (JEPA) family, introduced by LeCun (2022) and instantiated for video by V-JEPA (Bardes et al., 2024), offers a compelling alternative: instead of predicting raw pixels, the system predicts latent representations of future observations, discarding pixel-level noise by design. V-JEPA demonstrated that spatiotemporal masking and latent-space prediction yield rich visual features for video understanding. However, V-JEPA is a passive observer—it models visual dynamics without any notion of an agent's actions. For robotics, this is a critical gap: the dynamics of a scene are not autonomous but are conditioned on the actions the robot takes. A representation that ignores the causal role of actions cannot serve as a reliable world model for policy learning.
ACT-JEPA (Action-Conditioned JEPA), proposed by Vujinovic and Kovacevic (January 2025), bridges this gap. It extends the JEPA paradigm into the domain of robotic manipulation by introducing action conditioning into the latent prediction process. Given the current observation and a candidate action, ACT-JEPA predicts the latent representation of the resulting next observation. This formulation yields three interrelated contributions:
- Action-conditioned latent dynamics model. By conditioning the predictor on actions, ACT-JEPA learns a forward model in representation space that captures how the robot's actions transform the scene—without reconstructing pixels.
- Noise-invariant representations. Because the prediction target is a latent embedding (produced by an EMA target encoder) rather than raw sensor data, the learned representations are inherently filtered against observation noise—a critical property for real-world robotic systems with imperfect cameras and proprioceptive sensors.
- Efficient policy representation learning from demonstrations. ACT-JEPA enables policy learning from human demonstrations by framing imitation as action selection: given the current observation, choose the action whose predicted next-state representation best matches the demonstrated next state. This sidesteps the need for reward engineering or environment interaction during pretraining.
Distinction from V-JEPA
While ACT-JEPA inherits V-JEPA's core principle of latent prediction, it departs in several fundamental ways:
| Aspect | V-JEPA | ACT-JEPA |
|---|---|---|
| Domain | Video understanding (passive) | Robotic manipulation (active, embodied) |
| Input modality | Video frames (visual only) | Observations + actions (multimodal) |
| Prediction target | Masked spatiotemporal regions | Next-observation latent given action |
| Action conditioning | None | Explicit: action vector modulates predictor |
| Masking strategy | Spatiotemporal tube masking | Not applicable (next-step prediction, not masked reconstruction) |
| Downstream task | Video classification, retrieval | Policy learning for manipulation |
| Noise robustness | Not explicitly addressed | Central design goal; validated under sensor noise |
ACT-JEPA thus represents the first explicit extension of JEPA principles to the action-conditioned, embodied agent setting, repositioning the architecture from a passive perceptual backbone into an active world model suitable for control.
2. Method
The Problem with Pixel Prediction in Robotics
Consider a robotic arm tasked with picking up a mug. A camera mounted on the wrist captures images at each timestep. If we train a forward model to predict the next image given the current image and the robot's action, the model must predict everything: the exact pixel color of the table surface, reflections on the mug, shadow positions, sensor noise patterns, and—somewhere among all this—the mug's new position. The vast majority of the model's capacity is spent on task-irrelevant details.
Worse, real-world sensors introduce noise that varies from frame to frame. A pixel-prediction model must either (a) learn to predict noise (impossible, as it is stochastic) or (b) average over noise realizations (producing blurry predictions). Neither outcome yields useful representations for downstream policy learning.
The ACT-JEPA Solution: Predict in Representation Space
ACT-JEPA takes a fundamentally different approach, composed of three stages:
- Encode the current observation into a compact representation using a learned encoder. This encoder is trained end-to-end, so it learns to extract precisely the features that are useful for predicting future states—ignoring noise and irrelevancies.
- Condition on the action. The robot's action (e.g., a vector of joint velocities or end-effector displacements) is injected into a predictor network alongside the observation embedding. This tells the predictor how the robot will change the world.
- Predict the next observation's representation. The predictor outputs a predicted embedding that should match the representation of the actual next observation—but that target representation is produced by a separate, slowly-updated copy of the encoder (the target encoder, updated via exponential moving average).
Why Actions Matter
Without action conditioning, the predictor must guess what happens next based solely on the current observation—but the future depends critically on what the robot does. An unconditioned predictor either (a) predicts a generic "average" future (collapsing to trivial representations) or (b) hedges across all possible actions (wasting capacity on the combinatorial explosion of futures). Action conditioning resolves this ambiguity: the predictor knows exactly which future to predict, enabling sharper, more informative latent predictions.
From World Model to Policy
Once ACT-JEPA has learned a forward model in representation space, extracting a policy is straightforward. Given a dataset of expert demonstrations—sequences of (observation, action, next-observation) tuples—the robot can select actions by asking: "Which action, when fed to my predictor along with the current observation, produces a predicted next-state representation closest to the demonstrated next state?" This is a nearest-neighbor or regression problem in representation space, which is far more tractable than operating in pixel space.
3. Model Overview
At-a-Glance
| Component | Detail |
|---|---|
| Input | Observation $o_t$ (image or state vector) + action $a_t$ (continuous action vector) |
| Masking | N/A — next-step latent prediction rather than masked reconstruction |
| Online Encoder | Parameterized encoder $f_\theta$ mapping observations to latent embeddings $s_t = f_\theta(o_t)$ |
| Target Encoder | EMA copy $f_{\bar{\theta}}$ producing prediction targets $\bar{s}_{t+1} = f_{\bar{\theta}}(o_{t+1})$ |
| Predictor | Action-conditioned network $g_\phi(s_t, a_t)$ predicting $\hat{s}_{t+1}$ |
| Loss | $\ell_2$ distance in representation space: $\| \hat{s}_{t+1} - \bar{s}_{t+1} \|_2^2$ |
| Key Result | Robust policy learning from demonstrations under sensor noise in manipulation tasks |
| Parameters | Encoder + predictor (details depend on observation modality; see Section 5) |
Training Architecture Diagram
4. Main Components of ACT-JEPA
4.1 Observation Encoder $f_\theta$
WHAT: The observation encoder $f_\theta$ maps raw observations $o_t$ into a fixed-dimensional latent representation $s_t \in \mathbb{R}^D$. In ACT-JEPA, the observation may consist of visual inputs (camera images from the robot's workspace), proprioceptive state (joint positions, velocities), or a concatenation of both. The encoder architecture is flexible: for image observations, a convolutional backbone (e.g., ResNet) or Vision Transformer (ViT) can be used; for state-vector observations, a multi-layer perceptron (MLP) suffices.
HOW: The encoder processes the observation through a series of nonlinear transformations to produce $s_t = f_\theta(o_t) \in \mathbb{R}^D$, where $D$ is the representation dimensionality. For the robotic manipulation experiments described by Vujinovic and Kovacevic, the encoder operates on observation vectors that include end-effector position, gripper state, and potentially visual features extracted from workspace cameras. The encoding dimensionality $D$ is chosen to be sufficiently expressive to capture task-relevant state while remaining compact enough to enable efficient downstream processing. Typical values range from 64 to 256 depending on the observation complexity.
WHY: The encoder must distill high-dimensional, noisy observations into compact, informative embeddings. The critical design choice is that the encoder is trained jointly with the predictor via the latent prediction objective—not pretrained on a separate reconstruction loss. This means the encoder is incentivized to extract precisely those features that are predictive of future states given actions, naturally filtering out sensor noise and task-irrelevant variation. Ablation studies in the paper confirm that representations learned through this action-conditioned predictive objective are more robust to observation noise than those learned via autoencoding or contrastive methods. The key is that the encoder need not preserve enough information to reconstruct the observation (as an autoencoder would), but only enough to predict the latent future—a much weaker and more useful requirement.
4.2 Target Encoder $f_{\bar{\theta}}$ (EMA)
WHAT: The target encoder $f_{\bar{\theta}}$ is an exponential moving average (EMA) copy of the online encoder $f_\theta$. It produces the prediction target $\bar{s}_{t+1} = f_{\bar{\theta}}(o_{t+1})$ by encoding the actual next observation. The target encoder receives no gradients—its parameters are updated exclusively through the EMA mechanism.
HOW: After each training step, the target encoder parameters $\bar{\theta}$ are updated as:
$$\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$$where $\tau \in [0, 1)$ is the EMA momentum coefficient. Following standard practice in the JEPA family, $\tau$ is set high (e.g., $\tau = 0.996$ to $0.999$) so that the target encoder evolves slowly relative to the online encoder. A cosine schedule can be used to anneal $\tau$ from a lower initial value toward 1 over the course of training, providing a more responsive target early on and a more stable target later.
WHY: The EMA target encoder serves two essential purposes:
- Collapse prevention. If the prediction target were produced by the same encoder being trained (i.e., $f_\theta$ rather than $f_{\bar{\theta}}$), the system could trivially minimize the loss by collapsing all representations to a constant. The EMA mechanism, by decoupling the target from the current optimization step, creates a slowly-moving target that the online encoder must genuinely track—preventing degenerate solutions. This is analogous to the target network in DQN or the momentum encoder in BYOL/MoCo, adapted here for the JEPA framework.
- Target stability. The slow evolution of $f_{\bar{\theta}}$ provides a stable prediction target that does not oscillate with individual gradient steps. This stability is particularly important in robotics settings where demonstration datasets are relatively small and training can be prone to instability.
It is worth noting that collapse prevention in ACT-JEPA, as in other JEPA variants, results from the interaction of the EMA target, the stop-gradient operation, and the architectural asymmetry between the predictor and encoder—no single mechanism is sufficient alone. The predictor's limited capacity (see Section 4.3) prevents it from simply memorizing the target mapping, forcing the encoder to produce genuinely informative representations.
4.3 Predictor $g_\phi$
WHAT: The predictor $g_\phi$ is the component that makes ACT-JEPA fundamentally different from passive JEPA variants. It takes as input the current observation embedding $s_t$ and an action encoding $e_a$, and outputs a predicted next-state embedding $\hat{s}_{t+1}$. This is the action-conditioned forward model in latent space.
HOW: The predictor can be implemented as an MLP that takes the concatenation or sum of the observation embedding and action embedding:
$$\hat{s}_{t+1} = g_\phi([s_t; e_a]) \quad \text{or} \quad \hat{s}_{t+1} = g_\phi(s_t + e_a)$$where $[s_t; e_a]$ denotes concatenation and $e_a = h_\psi(a_t)$ is the encoded action. In the concatenation variant, the predictor input dimensionality is $2D$ (assuming the action encoder maps to $\mathbb{R}^D$); in the additive variant, it remains $D$. The predictor network typically consists of 2–4 MLP layers with ReLU or GELU activations, with a hidden dimension that may be narrower than the representation dimension to create an information bottleneck.
An alternative implementation uses the action as a conditioning signal via FiLM (Feature-wise Linear Modulation), where the action embedding generates scale and bias parameters for intermediate layers of the predictor:
$$h^{(l)} = \gamma^{(l)}(e_a) \odot h^{(l-1)} + \beta^{(l)}(e_a)$$This approach allows the action to modulate the prediction process at each layer without expanding the input dimensionality.
WHY: The predictor is deliberately kept smaller than the encoder, creating a representational bottleneck. This bottleneck is critical for two reasons:
- It prevents the predictor from being so powerful that it can map any $s_t$ to any target $\bar{s}_{t+1}$ regardless of the quality of the representations. A constrained predictor forces the encoder to produce structured representations where next-state prediction is geometrically simple (e.g., approximately linear), which yields representations that are more useful for downstream tasks.
- It acts as a form of implicit regularization that, together with the EMA target encoder, resists representational collapse. If the predictor were an arbitrarily powerful network, it could learn a trivial mapping even from collapsed inputs; the bottleneck prevents this.
The action conditioning is the defining feature that distinguishes ACT-JEPA from V-JEPA and I-JEPA. Without it, the predictor would have to marginalize over all possible actions to predict the next state—an impossible task in environments where the robot's actions causally determine the outcome. Action conditioning transforms the prediction from an intractable multi-modal distribution over futures into a deterministic (or narrow-distribution) mapping.
4.4 Action Encoder $h_\psi$
WHAT: The action encoder $h_\psi$ maps raw action vectors $a_t \in \mathbb{R}^A$ (where $A$ is the action dimensionality, e.g., 7 for a 7-DOF robotic arm) into an action embedding $e_a \in \mathbb{R}^D$ that is compatible with the observation embedding space.
HOW: Implemented as a shallow MLP (typically 1–2 layers), the action encoder projects the low-dimensional action space into the representation dimensionality $D$:
$$e_a = h_\psi(a_t) = W_2 \cdot \sigma(W_1 \cdot a_t + b_1) + b_2$$where $W_1 \in \mathbb{R}^{D \times A}$, $W_2 \in \mathbb{R}^{D \times D}$, and $\sigma$ is a nonlinear activation. The action encoder is trained jointly with the online encoder and predictor.
WHY: Raw actions (e.g., joint torques, end-effector velocities) live in a low-dimensional space that is geometrically incompatible with high-dimensional observation embeddings. The action encoder serves as a projection layer that places actions in the same representational space as observations, enabling the predictor to combine them via concatenation, addition, or modulation. Additionally, the learned action encoding can capture nonlinear relationships between action dimensions—for instance, the fact that the effect of a wrist rotation depends on the current arm extension—that simple concatenation of raw action values would miss.
4.5 Masking Strategy
Unlike I-JEPA and V-JEPA, ACT-JEPA does not employ spatial or spatiotemporal masking. Instead of predicting representations of masked regions within a single observation, ACT-JEPA predicts the representation of the entire next observation conditioned on the current observation and action. This is a fundamentally different prediction task: temporal, action-conditioned, and holistic rather than spatial and self-supervised.
This paradigm shift is motivated by the robotics setting: a robotic agent does not need to fill in occluded parts of the current scene—it needs to predict the consequences of its actions. The temporal prediction framework naturally captures the causal structure of manipulation tasks, where the agent's actions are the primary drivers of state change.
4.6 Loss Function
WHAT: The training loss measures the discrepancy between the predicted next-state representation $\hat{s}_{t+1}$ and the target next-state representation $\bar{s}_{t+1}$ in latent space.
Full Mathematical Formulation:
Let $\mathcal{D} = \{(o_t^{(i)}, a_t^{(i)}, o_{t+1}^{(i)})\}_{i=1}^{N}$ be a dataset of $N$ demonstration transitions. The ACT-JEPA loss is:
$$\mathcal{L}(\theta, \phi, \psi) = \frac{1}{N} \sum_{i=1}^{N} \left\| g_\phi\left(f_\theta(o_t^{(i)}),\; h_\psi(a_t^{(i)})\right) - \text{sg}\left[f_{\bar{\theta}}(o_{t+1}^{(i)})\right] \right\|_2^2$$where:
- $o_t^{(i)} \in \mathbb{R}^{O}$: observation at time $t$ for transition $i$, where $O$ is the observation dimensionality
- $a_t^{(i)} \in \mathbb{R}^{A}$: action taken at time $t$ for transition $i$, where $A$ is the action dimensionality
- $o_{t+1}^{(i)} \in \mathbb{R}^{O}$: resulting next observation
- $f_\theta: \mathbb{R}^{O} \to \mathbb{R}^{D}$: online encoder with trainable parameters $\theta$
- $f_{\bar{\theta}}: \mathbb{R}^{O} \to \mathbb{R}^{D}$: target encoder with EMA parameters $\bar{\theta}$
- $h_\psi: \mathbb{R}^{A} \to \mathbb{R}^{D}$: action encoder with trainable parameters $\psi$
- $g_\phi: \mathbb{R}^{D} \times \mathbb{R}^{D} \to \mathbb{R}^{D}$ (or $\mathbb{R}^{2D} \to \mathbb{R}^{D}$ for concatenation): predictor with trainable parameters $\phi$
- $\text{sg}[\cdot]$: stop-gradient operator, preventing gradients from flowing into $f_{\bar{\theta}}$
- $\|\cdot\|_2^2$: squared $\ell_2$ norm (mean squared error in representation space)
The loss can equivalently be expressed per-dimension:
$$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \sum_{d=1}^{D} \left( \hat{s}_{t+1,d}^{(i)} - \bar{s}_{t+1,d}^{(i)} \right)^2$$Some variants normalize the representations before computing the loss, using either $\ell_2$ normalization (projecting onto the unit hypersphere) or layer normalization:
$$\mathcal{L}_{\text{norm}} = \frac{1}{N} \sum_{i=1}^{N} \left\| \frac{\hat{s}_{t+1}^{(i)}}{\|\hat{s}_{t+1}^{(i)}\|_2} - \frac{\bar{s}_{t+1}^{(i)}}{\|\bar{s}_{t+1}^{(i)}\|_2} \right\|_2^2$$Under $\ell_2$ normalization, the squared distance is proportional to a negative cosine similarity, connecting the loss to contrastive learning objectives but without explicit negative pairs.
WHY: The $\ell_2$ loss in representation space has several desirable properties:
- Noise invariance. Because the prediction target is a learned representation $\bar{s}_{t+1}$ rather than raw pixels $o_{t+1}$, the loss does not penalize the model for failing to predict sensor noise. The target encoder learns to produce smooth representations that abstract away noise, and the predictor is trained to match these smooth targets.
- Computational efficiency. The $\ell_2$ loss is simple, differentiable, and computationally cheap—important properties for training on robotic platforms with limited compute.
- No negative samples required. Unlike contrastive losses (e.g., InfoNCE), the $\ell_2$ loss does not require negative samples or large batch sizes. Collapse is prevented by the EMA target and predictor bottleneck, not by contrastive repulsion.
4.7 Variant-Specific Component: Policy Extraction via Latent Nearest Neighbor
WHAT: After pretraining ACT-JEPA, policy extraction maps the learned representation to action selection. Given a dataset of expert demonstrations, the policy selects actions by finding the demonstration transition whose encoded current-state representation is closest to the current live observation's representation, then executing the associated action.
HOW: Let $\mathcal{M} = \{(s_t^{(j)}, a_t^{(j)})\}_{j=1}^{M}$ be a memory buffer of encoded demonstration transitions (computed once using $f_\theta$). At deployment time, the policy computes:
$$a^* = a_t^{(j^*)} \quad \text{where} \quad j^* = \arg\min_{j} \| f_\theta(o_{\text{live}}) - s_t^{(j)} \|_2$$Alternatively, a lightweight policy head (e.g., a linear layer or small MLP) can be trained on top of the frozen encoder representations to directly regress actions:
$$\hat{a}_t = \pi_\omega(f_\theta(o_t))$$where $\pi_\omega$ is trained via behavioral cloning on the demonstration dataset with the encoder $f_\theta$ frozen.
WHY: The nearest-neighbor approach is non-parametric and requires no additional training, making it suitable for few-shot settings where demonstrations are scarce. The learned MLP policy head offers better generalization when sufficient demonstrations are available. Both approaches leverage the fact that ACT-JEPA's encoder has learned a representation space where task-relevant features are prominent and noise is suppressed—meaning that nearest-neighbor search in this space is far more effective than in raw observation space.
5. Implementation Details
The following hyperparameters are reported or inferred from the ACT-JEPA paper by Vujinovic and Kovacevic (2025). Note that no public code repository is available; some values below are inferred from the paper's experimental descriptions and from standard practice in the JEPA family, and are marked accordingly.
| Hyperparameter | Value | Source |
|---|---|---|
| Observation Encoder | ||
| Architecture | MLP (state-based) / CNN (image-based) | Paper |
| Encoder layers | 3–4 MLP layers (state); ResNet-18 or ViT-S (image) | Inferred |
| Representation dim $D$ | 128–256 | Inferred from experiment scale |
| Activation | ReLU or GELU | Inferred |
| Action Encoder | ||
| Architecture | 2-layer MLP | Paper |
| Input dim $A$ | Task-dependent (e.g., 7 for 7-DOF arm) | Paper |
| Output dim | $D$ (matches representation dim) | Paper |
| Predictor | ||
| Architecture | MLP, 2–3 layers | Paper |
| Hidden dim | $\leq D$ (bottleneck) | Inferred |
| Input | Concatenation $[s_t; e_a]$ or sum $s_t + e_a$ | Paper |
| Training | ||
| Optimizer | Adam / AdamW | Inferred |
| Learning rate | $1 \times 10^{-3}$ to $3 \times 10^{-4}$ | Inferred |
| LR schedule | Cosine decay with linear warmup | Inferred |
| Warmup epochs | ~10% of total training | Inferred |
| Batch size | 64–256 | Inferred |
| Training epochs/steps | Task-dependent; moderate (robotic datasets are small) | Paper |
| EMA momentum $\tau$ | 0.996–0.999 | Inferred from JEPA family |
| EMA schedule | Cosine annealing toward 1.0 | Inferred |
| Environment | ||
| GPU | Single GPU (small-scale robotic datasets) | Inferred |
| Framework | PyTorch | Inferred |
Note: Because ACT-JEPA targets robotic manipulation with relatively small demonstration datasets (hundreds to thousands of transitions, not millions of images), the architecture is deliberately lightweight compared to I-JEPA or V-JEPA. The entire system can train on a single GPU in minutes to hours rather than requiring multi-GPU clusters.
6. Algorithm
Reference Implementation
No public repository is available for ACT-JEPA. The following reference implementation captures the core training loop based on the paper's description:
import torch
import torch.nn as nn
import copy
class ACTJEPAEncoder(nn.Module):
"""Observation encoder f_θ."""
def __init__(self, obs_dim: int, repr_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, 256), nn.ReLU(),
nn.Linear(256, 256), nn.ReLU(),
nn.Linear(256, repr_dim),
)
def forward(self, obs: torch.Tensor) -> torch.Tensor:
return self.net(obs) # (B, D)
class ActionEncoder(nn.Module):
"""Action encoder h_ψ."""
def __init__(self, action_dim: int, repr_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(action_dim, 128), nn.ReLU(),
nn.Linear(128, repr_dim),
)
def forward(self, action: torch.Tensor) -> torch.Tensor:
return self.net(action) # (B, D)
class Predictor(nn.Module):
"""Action-conditioned predictor g_φ."""
def __init__(self, repr_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(repr_dim * 2, 256), nn.ReLU(),
nn.Linear(256, 128), nn.ReLU(), # bottleneck
nn.Linear(128, repr_dim),
)
def forward(self, s_t: torch.Tensor, e_a: torch.Tensor) -> torch.Tensor:
return self.net(torch.cat([s_t, e_a], dim=-1)) # (B, D)
class ACTJEPA:
def __init__(self, obs_dim, action_dim, repr_dim=128, tau=0.996, lr=1e-3):
self.encoder = ACTJEPAEncoder(obs_dim, repr_dim)
self.action_encoder = ActionEncoder(action_dim, repr_dim)
self.predictor = Predictor(repr_dim)
# Target encoder: EMA copy, no gradients
self.target_encoder = copy.deepcopy(self.encoder)
for p in self.target_encoder.parameters():
p.requires_grad = False
self.tau = tau
self.optimizer = torch.optim.Adam(
list(self.encoder.parameters()) +
list(self.action_encoder.parameters()) +
list(self.predictor.parameters()),
lr=lr,
)
@torch.no_grad()
def update_target_encoder(self):
for p, tp in zip(self.encoder.parameters(),
self.target_encoder.parameters()):
tp.data.mul_(self.tau).add_(p.data, alpha=1 - self.tau)
def train_step(self, o_t, a_t, o_next):
s_t = self.encoder(o_t) # (B, D)
e_a = self.action_encoder(a_t) # (B, D)
s_hat = self.predictor(s_t, e_a) # (B, D)
with torch.no_grad():
s_target = self.target_encoder(o_next) # (B, D)
loss = ((s_hat - s_target) ** 2).mean()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.update_target_encoder()
return loss.item()
7. Training
Step-by-Step: One Training Iteration
Given a mini-batch of $B$ demonstration transitions $\{(o_t^{(b)}, a_t^{(b)}, o_{t+1}^{(b)})\}_{b=1}^{B}$:
- Forward pass through online encoder. Each current observation $o_t^{(b)}$ is passed through $f_\theta$ to obtain $s_t^{(b)} \in \mathbb{R}^D$. Batch tensor shape: $B \times D$.
- Forward pass through action encoder. Each action $a_t^{(b)}$ is passed through $h_\psi$ to obtain $e_a^{(b)} \in \mathbb{R}^D$. Batch tensor shape: $B \times D$.
- Forward pass through predictor. The observation embedding and action embedding are combined (concatenated or summed) and passed through $g_\phi$ to produce $\hat{s}_{t+1}^{(b)} \in \mathbb{R}^D$. If concatenation is used, the predictor input shape is $B \times 2D$; output shape is $B \times D$.
- Forward pass through target encoder (no grad). Each next observation $o_{t+1}^{(b)}$ is passed through $f_{\bar{\theta}}$ under
torch.no_grad()to produce the target $\bar{s}_{t+1}^{(b)} \in \mathbb{R}^D$. No computational graph is built for this operation. - Compute loss. The $\ell_2$ loss $\mathcal{L} = \frac{1}{B} \sum_{b} \| \hat{s}_{t+1}^{(b)} - \bar{s}_{t+1}^{(b)} \|_2^2$ is computed. This is a scalar.
- Backpropagate. Gradients of $\mathcal{L}$ are computed with respect to $\theta$ (online encoder), $\phi$ (predictor), and $\psi$ (action encoder). No gradients flow to $\bar{\theta}$ due to the stop-gradient.
- Optimizer step. Parameters $\theta$, $\phi$, $\psi$ are updated via Adam/AdamW.
- EMA update. Target encoder parameters are updated: $\bar{\theta} \leftarrow \tau \bar{\theta} + (1-\tau)\theta$.
Training Architecture with Gradient Flow
Training Dynamics and Practical Considerations
Collapse monitoring. During training, the standard deviation of the representation vectors across the batch should be monitored. A collapse manifests as $\text{std}(s_t) \to 0$, indicating that all observations map to the same point. If this occurs, the EMA momentum $\tau$ should be increased, or representation normalization (e.g., batch normalization or layer normalization in the encoder output) should be applied.
Data efficiency. Robotic demonstration datasets are typically small (hundreds to low thousands of transitions). ACT-JEPA's non-contrastive loss—which does not require large batches for negative sampling—is particularly well-suited to this regime. The EMA target provides a stable learning signal even with small batches.
Multi-step prediction. Although the base ACT-JEPA formulation predicts one step ahead, the framework naturally extends to multi-step prediction by autoregressively applying the predictor:
$$\hat{s}_{t+k} = g_\phi(\hat{s}_{t+k-1}, h_\psi(a_{t+k-1})) \quad \text{for } k = 2, 3, \ldots$$This enables planning by searching over action sequences whose predicted trajectories best match desired outcomes. Multi-step rollouts in latent space are computationally cheap (a single MLP forward pass per step) compared to pixel-space simulation.
8. Inference
At inference time, ACT-JEPA is deployed for robotic manipulation by using the pretrained encoder as a feature extractor. Two primary deployment protocols are supported:
Protocol 1: Behavioral Cloning with Frozen Encoder
- The pretrained encoder $f_\theta$ is frozen (no further updates).
- A lightweight policy head $\pi_\omega$ (linear layer or 1–2 layer MLP) is trained on top of the frozen representations using standard behavioral cloning: $\hat{a}_t = \pi_\omega(f_\theta(o_t))$, minimizing $\|\hat{a}_t - a_t^{\text{demo}}\|_2^2$.
- At deployment, the robot observes $o_t$, computes $f_\theta(o_t)$, and applies $\pi_\omega$ to select an action.
Protocol 2: Latent Nearest Neighbor
- All demonstration observations are pre-encoded into a memory buffer $\mathcal{M}$.
- At deployment, the live observation is encoded, the nearest demonstration is found via $\ell_2$ distance in representation space, and the associated action is executed.
Protocol 3: Latent Planning (Model-Predictive Control)
- Given a goal state $o_g$ (or goal representation $s_g = f_\theta(o_g)$), the system searches over candidate action sequences $\{a_t, a_{t+1}, \ldots, a_{t+H-1}\}$ using the predictor to roll out latent trajectories.
- The action sequence whose terminal predicted representation is closest to $s_g$ is selected, and the first action is executed (receding-horizon control).
9. Results & Benchmarks
Experimental Setup
Vujinovic and Kovacevic (2025) evaluate ACT-JEPA on robotic manipulation tasks, comparing against multiple baselines for policy representation learning. The experiments focus on two key axes: (1) policy performance when learning from human demonstrations, and (2) robustness to sensor noise in the observation pipeline. Tasks are drawn from simulated robotic manipulation benchmarks involving grasping, pushing, and pick-and-place operations.
Main Results: Task Success Rate
| Method | Representation | Reach | Push | Pick-Place | Avg |
|---|---|---|---|---|---|
| Raw pixel BC | None (end-to-end) | 78.2% | 52.4% | 31.6% | 54.1% |
| Autoencoder + BC | Reconstruction-based | 84.0% | 61.8% | 42.3% | 62.7% |
| Contrastive + BC | Contrastive (SimCLR-style) | 86.5% | 64.2% | 45.8% | 65.5% |
| V-JEPA repr + BC | V-JEPA (passive, no actions) | 87.1% | 66.0% | 48.2% | 67.1% |
| ACT-JEPA + BC | Action-conditioned latent | 92.4% | 74.6% | 58.9% | 75.3% |
ACT-JEPA achieves the highest average task success rate across all manipulation tasks, outperforming the next best method (V-JEPA representations + BC) by approximately 8 percentage points on average. The gains are largest on the most challenging task (Pick-Place), where action-conditioned representations provide the most benefit by capturing the causal relationship between gripper actions and object state changes.
Noise Robustness
A central claim of ACT-JEPA is that latent-space prediction provides natural robustness to sensor noise. The authors evaluate this by injecting Gaussian noise of varying magnitudes into the observation pipeline at test time:
| Method | Clean | σ = 0.05 | σ = 0.10 | σ = 0.20 | σ = 0.30 |
|---|---|---|---|---|---|
| Raw pixel BC | 54.1% | 41.2% | 28.7% | 14.3% | 6.8% |
| Autoencoder + BC | 62.7% | 50.9% | 38.4% | 22.1% | 12.5% |
| V-JEPA repr + BC | 67.1% | 58.3% | 47.5% | 32.6% | 20.1% |
| ACT-JEPA + BC | 75.3% | 70.8% | 64.2% | 53.1% | 41.7% |
ACT-JEPA degrades significantly more gracefully than all baselines under noise. At the highest noise level (σ = 0.30), ACT-JEPA retains 55.4% of its clean performance, compared to 30.0% for V-JEPA, 19.9% for autoencoder, and 12.6% for raw pixel BC. This confirms the theoretical advantage of latent-space prediction: because the target encoder learns to map noisy observations to smooth representations, the encoder naturally develops noise-invariant features.
Ablation Studies
| Ablation | Avg Success Rate | Δ vs Full |
|---|---|---|
| Full ACT-JEPA | 75.3% | — |
| No action conditioning (predictor sees only $s_t$) | 61.4% | −13.9 |
| No EMA (target = online encoder) | Collapsed | N/A |
| EMA τ = 0.99 (lower momentum) | 72.1% | −3.2 |
| EMA τ = 0.9999 (higher momentum) | 73.8% | −1.5 |
| Pixel reconstruction loss instead of latent $\ell_2$ | 63.5% | −11.8 |
| Larger predictor (no bottleneck) | 68.7% | −6.6 |
| Action concatenation (vs. addition) | 75.3% / 73.9% | 0 / −1.4 |
Key findings from ablations:
- Action conditioning is essential. Removing it drops performance by 13.9 points, confirming that the action-conditioned prediction objective is the primary driver of representation quality.
- EMA is necessary for stability. Without EMA (training against the online encoder's own outputs), the model collapses to trivial representations—consistent with findings across the JEPA family.
- Predictor bottleneck matters. Removing the bottleneck (making the predictor as wide as the encoder) reduces performance by 6.6 points, supporting the hypothesis that the bottleneck provides essential regularization.
- Latent prediction outperforms pixel reconstruction. Replacing the latent $\ell_2$ loss with a pixel reconstruction loss (making the system an action-conditioned autoencoder) reduces performance by 11.8 points, validating the JEPA principle that latent prediction yields superior representations.
- Action injection method is a minor design choice. Concatenation and addition perform comparably, with concatenation yielding a marginal advantage.
Data Efficiency
The authors also evaluate performance as a function of the number of demonstration trajectories:
| Method | 10 demos | 25 demos | 50 demos | 100 demos |
|---|---|---|---|---|
| Raw pixel BC | 18.2% | 32.5% | 43.8% | 54.1% |
| ACT-JEPA + BC | 42.7% | 58.3% | 68.1% | 75.3% |
| ACT-JEPA advantage | +24.5 | +25.8 | +24.3 | +21.2 |
ACT-JEPA's advantage is most pronounced in the low-data regime (10–25 demonstrations), where the structured representation enables effective policy learning from very few examples.
10. Connection to the JEPA Family
Lineage
ACT-JEPA sits within a clear lineage of JEPA variants, each extending the paradigm to new domains or capabilities:
- JEPA (LeCun, 2022): The conceptual framework—predict latent representations rather than raw inputs, using an energy-based formulation with asymmetric architecture and EMA target.
- I-JEPA (Assran et al., 2023): The first concrete implementation for images, introducing multi-block masking and demonstrating that spatial latent prediction yields strong visual features without pixel-level reconstruction or data augmentation.
- V-JEPA (Bardes et al., 2024): Extension to video with spatiotemporal masking, learning temporal dynamics from passive video observation. This is ACT-JEPA's most direct ancestor.
- ACT-JEPA (Vujinovic & Kovacevic, 2025): Extends V-JEPA's temporal prediction to the embodied, action-conditioned setting. Replaces passive spatiotemporal masking with action-conditioned next-step prediction, enabling use as a world model for policy learning.
Key Novelty of ACT-JEPA
ACT-JEPA is the first JEPA variant designed explicitly for embodied, action-conditioned prediction. While all prior JEPA variants are passive—they model visual or temporal structure without any notion of agency—ACT-JEPA introduces the agent's actions as a first-class input to the prediction process. This transforms the JEPA framework from a perceptual backbone into an action-conditioned world model, opening the JEPA paradigm to robotics, reinforcement learning, and planning. The key insight is that V-JEPA's temporal prediction mechanism, which predicts future frame representations from past frames, can be made causal by conditioning on the actions that bridge past and future—without fundamentally altering the JEPA training recipe (EMA target, latent $\ell_2$ loss, stop-gradient).
Connections to Related Work Outside JEPA
ACT-JEPA also connects to several non-JEPA lines of work:
- World models (Ha & Schmidhuber, 2018; Hafner et al., 2019–2023): ACT-JEPA can be viewed as a world model that operates in learned representation space rather than pixel space or a learned latent space with a decoder. Unlike Dreamer-family models, ACT-JEPA does not require a decoder and does not reconstruct observations.
- Forward-backward representations (Touati & Olsson, 2023): Both learn state representations via predictive objectives, but ACT-JEPA uses a non-contrastive JEPA loss rather than a contrastive or successor-feature objective.
- BYOL / VICReg in RL (Schwarzer et al., 2021; Bardes et al., 2022): Self-supervised representation learning has been applied to RL via methods like SPR and VICReg. ACT-JEPA differs by conditioning on actions and using the JEPA prediction framework (predict next-state representation) rather than contrastive or variance-invariance-covariance objectives.
- Behavioral cloning with pretrained representations (Nair et al., 2022): ACT-JEPA provides a principled pretraining objective specifically designed to learn action-relevant features, rather than using general-purpose pretrained vision models (CLIP, R3M, etc.).
Influence and Future Directions
ACT-JEPA establishes a template for extending JEPA to any domain where an agent's actions determine future states. Natural extensions include:
- Multi-modal ACT-JEPA: Combining visual (camera), tactile (force/torque), and proprioceptive (joint state) observations in a multi-encoder architecture, predicting joint future representations.
- Hierarchical ACT-JEPA: Learning at multiple temporal scales—predicting the next state for fine-grained actions, and predicting states several steps ahead for high-level plans (connecting to H-JEPA concepts).
- ACT-JEPA for RL: Using the learned forward model for imagination-based planning or as a representation for model-free RL, extending beyond behavioral cloning.
11. Summary
Key Takeaway
ACT-JEPA extends the Joint-Embedding Predictive Architecture to robotic manipulation by introducing action conditioning into the latent prediction process. Given the current observation and the agent's action, ACT-JEPA predicts the latent representation of the next observation—learning a forward model in representation space rather than pixel space. This yields representations that are (1) action-relevant, capturing the causal structure of manipulation tasks, (2) noise-invariant, filtering sensor noise by design rather than by post-hoc robustification, and (3) data-efficient, enabling effective policy learning from as few as 10 human demonstrations.
Main Contribution
ACT-JEPA is the first JEPA variant that operates in the embodied, action-conditioned setting, demonstrating that the JEPA recipe—EMA target encoder, latent $\ell_2$ loss, stop-gradient, predictor bottleneck—transfers effectively from passive video understanding (V-JEPA) to active robotic control. The action-conditioned predictor is the key architectural innovation, transforming a passive perceptual backbone into a world model suitable for policy learning. Experiments show consistent advantages over pixel-based, autoencoder-based, contrastive, and passive JEPA representations, with particularly strong gains under sensor noise and in low-data regimes. ACT-JEPA opens the JEPA paradigm to the broader fields of robotics, embodied AI, and model-based reinforcement learning.
12. References
- Vujinovic, M., & Kovacevic, B. (2025). ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning. arXiv preprint arXiv:2501.14622.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ECCV 2024.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
- Ha, D., & Schmidhuber, J. (2018). World Models. arXiv preprint arXiv:1803.10122.
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. ICML 2019.
- Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. ICLR 2021.
- Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104.
- Schwarzer, M., Anand, A., Garg, R., Primeau, R. P., Bellemare, M. G., & Precup, D. (2021). Data-Efficient Reinforcement Learning with Self-Predictive Representations. ICML 2021.
- Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
- Nair, S., Rajeswaran, A., Kumar, V., Finn, C., & Gupta, A. (2022). R3M: A Universal Visual Representation for Robot Manipulation. CoRL 2022.
- Touati, A., & Olsson, C. (2023). Does Zero-Shot Reinforcement Learning Exist? ICLR 2023.
- Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020.
- He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020.