Motivation
Vision-language-action (VLA) models are semantically rich: they read an instruction, look at pixels, and emit actions. But they are myopic — they act reactively, with no internal simulation of what their actions will cause, so they cannot plan ahead, anticipate failure, or verify that a chosen action actually advances the instructed goal. JEPA world models are the mirror image: strong at predicting latent dynamics, weak at language grounding. VLA-JEPA proposes to fuse them, giving the instruction-following policy the predictive look-ahead it lacks while keeping the world model's planning loop grounded in natural-language goals.
How it works
A JEPA context encoder embeds observations into $z_t$ and a predictor advances the latent under actions, $\hat z_{t+1}=g_\phi(z_t,a_t)$, trained with latent prediction loss, stop-gradient, and anti-collapse against a target encoder. This latent world model is coupled to a VLA policy: language and vision condition both the action proposals the VLA generates and the predictive rollouts the JEPA produces, so the predictor is effectively $\hat z_{t+1}=g_\phi(z_t,a_t,\ell)$ with $\ell$ the instruction embedding.
The VLA proposes instruction-conditioned actions; the world model imagines their latent consequences; the agent selects or refines actions by predicted outcome — for example by scoring imagined trajectories against a language-specified goal. High-level intent and low-level dynamics thus reinforce each other in a shared latent space.
The objective
The world model is trained with the language- and action-conditioned latent loss
$$\mathcal{L}_{\text{wm}} = \big\lVert g_\phi(z_t, a_t, \ell) - \operatorname{sg}[\bar f_\theta(o_{t+1})] \big\rVert^2 + \lambda\,\mathcal{R}(Z),$$
alongside the VLA's instruction-conditioned action objective. At inference the coupled system selects actions by imagined outcome,
$$\max_{a_{t:t+H}}\; \text{score}\big(\hat z_{t+H},\, \ell\big), \qquad \hat z_{t+k+1}=g_\phi(\hat z_{t+k},a_{t+k},\ell),$$
so the semantically rich policy gains model-based foresight while the world model inherits language-specified goals.
Key results & what's novel
The contribution is a unified architecture in which high-level, language-grounded intent (the VLA) and low-level predictive dynamics (the JEPA world model) are coupled rather than separate. This adds the model-based look-ahead that VLAs miss — letting the agent imagine and compare consequences of instruction-conditioned actions before committing — while grounding the world model's planning loop in natural-language goals instead of opaque goal embeddings. The result is instruction following combined with predictive verification in a shared self-supervised latent space.
Strengths & limitations
- + Gives instruction-following policies model-based foresight and outcome verification.
- + Goals can be specified in language rather than as latent embeddings.
- + Reuses a self-supervised JEPA world model as the predictive substrate.
- − Two large coupled models raise training and inference cost.
- − Planning quality is bounded by the fidelity of language-conditioned latent transitions.
- − A deterministic predictor cannot represent ambiguity in how an instruction should unfold.