ThinkJEPA — World Modeling

At a glance

ProblemDense JEPA world models excel at local latent prediction but lack abstract semantic reasoning for long horizons; vision-language reasoners are semantically rich but lack a grounded model of dynamics.

Key ideaCouple a dense JEPA latent world model with a higher-level vision-language thinker that supplies long-horizon semantic guidance.

ModalityVision + language (+ actions)

Target / maskingDense JEPA latent prediction against a target encoder; the thinker steers rollouts.

Builds onJEPA action-conditioned world models and vision-language reasoning models.

Used forSemantically directed, long-horizon behavior with grounded latent dynamics.

Motivation

Two kinds of model have complementary blind spots. A dense JEPA world model is excellent at fine-grained, local latent prediction — it knows what happens next — but it has no abstract, semantic sense of why or toward what longer-term end. A vision-language reasoning model is the reverse: semantically rich and capable of multi-step reasoning, but with no grounded predictive model of dynamics, so its plans float free of what is physically achievable. ThinkJEPA combines them, pairing a fast predictive simulator with a slow semantic deliberator so that behavior is both grounded and far-sighted.

How it works

Canonical JEPA schematic for Vision+Language. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model. A local/generative loss runs alongside latent prediction (hybrid objective).

The dense JEPA component is the low-level world model: a context encoder, an action-conditioned predictor $\hat z_{t+1}=g_\phi(z_t,a_t)$, and a latent prediction loss with anti-collapse against a target encoder, capturing detailed latent dynamics. Above it sits a vision-language reasoning model, the "thinker," which interprets goals, proposes subgoals or intentions, and steers the latent rollouts the JEPA produces.

The two levels operate at different timescales: the thinker reasons over semantic abstractions and sets targets, while the JEPA grounds those abstractions in predictable latent state and simulates the consequences of actions toward them. This couples slow, language-mediated deliberation with fast predictive simulation, echoing dual-process accounts of cognition.

The objective

The dense world model is trained with the action-conditioned latent loss

$$\mathcal{L}_{\text{wm}} = \big\lVert g_\phi(z_t, a_t) - \operatorname{sg}[\bar f_\theta(o_{t+1})] \big\rVert^2 + \lambda\,\mathcal{R}(Z).$$

The thinker supplies a semantic subgoal latent $z^{*}$ (or intention) for a longer horizon, and low-level control solves toward it,

$$\min_{a_{t:t+H}} \; \big\lVert \hat z_{t+H} - z^{*} \big\rVert, \qquad \hat z_{t+k+1}=g_\phi(\hat z_{t+k},a_{t+k}),$$

so language-mediated reasoning sets targets that the fast latent simulator pursues.

Key results & what's novel

ThinkJEPA exemplifies a two-level architecture where language-mediated reasoning guides dense latent world modeling, extending JEPA planning toward semantically directed, long-horizon behavior without sacrificing grounded dynamics. Its novelty is the explicit coupling of a high-level thinking module over a fast predicting module: the thinker decomposes distant, semantic goals into subgoals the JEPA can actually reach, while the JEPA keeps the thinker's abstractions tethered to predictable state. This separates the problem of what to aim for from how to get there.

Strengths & limitations

+ Combines semantic, long-horizon reasoning with grounded latent dynamics.
+ Subgoal decomposition extends planning beyond a flat JEPA's reach.
+ Modular: thinker and world model can be improved independently.
− A large vision-language reasoner adds significant inference cost.
− Subgoals proposed by the thinker may be ungrounded or unreachable by the controller.
− Coordinating the two timescales and interfaces is an additional design burden.

Connections & references

Builds onV-JEPA 2 V-JEPA

Paper ↗