Motivation
Two kinds of model have complementary blind spots. A dense JEPA world model is excellent at fine-grained, local latent prediction — it knows what happens next — but it has no abstract, semantic sense of why or toward what longer-term end. A vision-language reasoning model is the reverse: semantically rich and capable of multi-step reasoning, but with no grounded predictive model of dynamics, so its plans float free of what is physically achievable. ThinkJEPA combines them, pairing a fast predictive simulator with a slow semantic deliberator so that behavior is both grounded and far-sighted.
How it works
The dense JEPA component is the low-level world model: a context encoder, an action-conditioned predictor $\hat z_{t+1}=g_\phi(z_t,a_t)$, and a latent prediction loss with anti-collapse against a target encoder, capturing detailed latent dynamics. Above it sits a vision-language reasoning model, the "thinker," which interprets goals, proposes subgoals or intentions, and steers the latent rollouts the JEPA produces.
The two levels operate at different timescales: the thinker reasons over semantic abstractions and sets targets, while the JEPA grounds those abstractions in predictable latent state and simulates the consequences of actions toward them. This couples slow, language-mediated deliberation with fast predictive simulation, echoing dual-process accounts of cognition.
The objective
The dense world model is trained with the action-conditioned latent loss
$$\mathcal{L}_{\text{wm}} = \big\lVert g_\phi(z_t, a_t) - \operatorname{sg}[\bar f_\theta(o_{t+1})] \big\rVert^2 + \lambda\,\mathcal{R}(Z).$$
The thinker supplies a semantic subgoal latent $z^{*}$ (or intention) for a longer horizon, and low-level control solves toward it,
$$\min_{a_{t:t+H}} \; \big\lVert \hat z_{t+H} - z^{*} \big\rVert, \qquad \hat z_{t+k+1}=g_\phi(\hat z_{t+k},a_{t+k}),$$
so language-mediated reasoning sets targets that the fast latent simulator pursues.
Key results & what's novel
ThinkJEPA exemplifies a two-level architecture where language-mediated reasoning guides dense latent world modeling, extending JEPA planning toward semantically directed, long-horizon behavior without sacrificing grounded dynamics. Its novelty is the explicit coupling of a high-level thinking module over a fast predicting module: the thinker decomposes distant, semantic goals into subgoals the JEPA can actually reach, while the JEPA keeps the thinker's abstractions tethered to predictable state. This separates the problem of what to aim for from how to get there.
Strengths & limitations
- + Combines semantic, long-horizon reasoning with grounded latent dynamics.
- + Subgoal decomposition extends planning beyond a flat JEPA's reach.
- + Modular: thinker and world model can be improved independently.
- − A large vision-language reasoner adds significant inference cost.
- − Subgoals proposed by the thinker may be ungrounded or unreachable by the controller.
- − Coordinating the two timescales and interfaces is an additional design burden.