Value-Guided Action Planning with JEPA World Models

At a glance

ProblemPlain model-predictive control over a JEPA world model scores rollouts by distance to a goal latent, a greedy metric that misjudges long horizons, dead-ends, and irreversible states.

Key ideaLearn a value function over latent states and let it guide action search, so short rollouts approximate long-horizon outcomes.

ModalityVision (+ actions)

Target / maskingStandard JEPA latent prediction against a target encoder; value learned on top of frozen or co-trained latents.

Builds onJEPA action-conditioned world models (V-JEPA 2) and value-based reinforcement learning.

Used forValue-guided action planning with longer effective horizon and tractable search.

Motivation

A JEPA world model gives an agent imagined transitions, and the simplest way to use them is to sample action sequences, roll the latent forward, and pick the rollout whose final state is closest to a goal embedding. But Euclidean distance to a goal is a poor guide over long horizons: it cannot see around obstacles, distinguish recoverable from irreversible states, or recognise that a detour now enables progress later. Reinforcement learning solved this long ago with a learned value that amortises the cost-to-go. This work brings that idea to JEPA planning, replacing greedy goal-proximity with a learned notion of progress evaluated in the model's latent space.

How it works

Canonical JEPA schematic for Video. The input is split into a visible context and hidden targets (tubelet-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model.

The JEPA backbone is unchanged: a context encoder embeds observations into $z_t$ and a predictor advances the latent under actions, $\hat z_{t+1}=g_\phi(z_t,a_t)$, trained with latent prediction loss and anti-collapse against a target encoder. On top of these latents a value function $V_\psi(z)$ is learned, estimating expected long-horizon return (or negative cost-to-go) from a latent state.

Planning then optimises action sequences — via sampling, gradient descent, or tree search — to maximise the accumulated value of the predicted latent rollout rather than mere terminal goal proximity. Because $V_\psi$ summarises everything beyond the rollout horizon, a short, cheap rollout can stand in for a long-horizon decision, and the planner is steered away from dead-ends that look superficially close to the goal.

The objective

The world model is trained as usual, $\mathcal{L}_{\text{wm}}=\lVert g_\phi(z_t,a_t)-\operatorname{sg}[\bar f_\theta(o_{t+1})]\rVert^2 + \lambda\mathcal{R}(Z)$. The value is fit on latent states, e.g. by a temporal-difference or regression target,

$$\mathcal{L}_{V} = \big( V_\psi(z_t) - (r_t + \gamma\,V_\psi(z_{t+1})) \big)^2.$$

Planning maximises value along the imagined trajectory,

$$\max_{a_{t:t+H}} \; \sum_{k=0}^{H} \gamma^{k}\, V_\psi(\hat z_{t+k}), \qquad \hat z_{t+k+1}=g_\phi(\hat z_{t+k},a_{t+k}),$$

so value supplies direction and the world model supplies imagined transitions.

Key results & what's novel

The contribution is value-guided action planning: integrating a learned latent value with JEPA rollouts so the planner optimises predicted long-term benefit instead of greedy goal distance. This bridges energy-based JEPA planning with value-based reinforcement learning — the world model imagines, the value judges — and improves decision quality and effective horizon while keeping search tractable, because the dynamics are smooth in the abstract latent space and the value collapses the unrolled future into a single scalar. The result is better long-horizon behavior without the prohibitive rollout lengths a pure-MPC approach would require.

Strengths & limitations

+ Extends effective planning horizon without longer, error-prone rollouts.
+ Avoids dead-ends and irreversible states that fool distance-to-goal scoring.
+ Combines complementary strengths of world models and value learning.
− Learning a reliable value needs reward or success signal and can be unstable.
− Value error compounds with model error along the rollout.
− Adds a second trained component and its hyperparameters to the pipeline.

Connections & references

Builds onV-JEPA 2 V-JEPA

Paper ↗