At a glance
ProblemVision-language-action models map instructions and pixels to actions but have no internal model of consequences, so they cannot plan, anticipate, or verify outcomes before acting.
Key ideaCouple a VLA policy with a JEPA latent world model so language-grounded intent can be checked against imagined latent consequences.
ModalityVision + language + actions
Target / maskingJEPA latent prediction against a target encoder; world model conditioned on language and action.
Builds onJEPA action-conditioned world models (V-JEPA 2) and vision-language-action policies.
Used forInstruction-following control with model-based foresight and outcome-driven action selection.

Motivation

Vision-language-action (VLA) models are semantically rich: they read an instruction, look at pixels, and emit actions. But they are myopic — they act reactively, with no internal simulation of what their actions will cause, so they cannot plan ahead, anticipate failure, or verify that a chosen action actually advances the instructed goal. JEPA world models are the mirror image: strong at predicting latent dynamics, weak at language grounding. VLA-JEPA proposes to fuse them, giving the instruction-following policy the predictive look-ahead it lacks while keeping the world model's planning loop grounded in natural-language goals.

How it works

Vision+Languagetokens · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copyaction aₜlocal loss (e.g. MLM)
Canonical JEPA schematic for Vision+Language. The input is split into a visible context and hidden targets (token-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model. A local/generative loss runs alongside latent prediction (hybrid objective).

A JEPA context encoder embeds observations into $z_t$ and a predictor advances the latent under actions, $\hat z_{t+1}=g_\phi(z_t,a_t)$, trained with latent prediction loss, stop-gradient, and anti-collapse against a target encoder. This latent world model is coupled to a VLA policy: language and vision condition both the action proposals the VLA generates and the predictive rollouts the JEPA produces, so the predictor is effectively $\hat z_{t+1}=g_\phi(z_t,a_t,\ell)$ with $\ell$ the instruction embedding.

The VLA proposes instruction-conditioned actions; the world model imagines their latent consequences; the agent selects or refines actions by predicted outcome — for example by scoring imagined trajectories against a language-specified goal. High-level intent and low-level dynamics thus reinforce each other in a shared latent space.

The objective

The world model is trained with the language- and action-conditioned latent loss

$$\mathcal{L}_{\text{wm}} = \big\lVert g_\phi(z_t, a_t, \ell) - \operatorname{sg}[\bar f_\theta(o_{t+1})] \big\rVert^2 + \lambda\,\mathcal{R}(Z),$$

alongside the VLA's instruction-conditioned action objective. At inference the coupled system selects actions by imagined outcome,

$$\max_{a_{t:t+H}}\; \text{score}\big(\hat z_{t+H},\, \ell\big), \qquad \hat z_{t+k+1}=g_\phi(\hat z_{t+k},a_{t+k},\ell),$$

so the semantically rich policy gains model-based foresight while the world model inherits language-specified goals.

Key results & what's novel

The contribution is a unified architecture in which high-level, language-grounded intent (the VLA) and low-level predictive dynamics (the JEPA world model) are coupled rather than separate. This adds the model-based look-ahead that VLAs miss — letting the agent imagine and compare consequences of instruction-conditioned actions before committing — while grounding the world model's planning loop in natural-language goals instead of opaque goal embeddings. The result is instruction following combined with predictive verification in a shared self-supervised latent space.

Strengths & limitations

  • + Gives instruction-following policies model-based foresight and outcome verification.
  • + Goals can be specified in language rather than as latent embeddings.
  • + Reuses a self-supervised JEPA world model as the predictive substrate.
  • Two large coupled models raise training and inference cost.
  • Planning quality is bounded by the fidelity of language-conditioned latent transitions.
  • A deterministic predictor cannot represent ambiguity in how an instruction should unfold.

Connections & references