At a glance
ProblemFlat latent world models plan at a single timestep granularity, so long-horizon tasks need long, expensive rollouts that accumulate prediction error.
Key ideaAdd dynamics at multiple temporal scales and plan top-down — coarse subgoals over long intervals, fine action segments to reach them.
ModalityVision (+ actions)
Target / maskingJEPA latent prediction against a target encoder, at multiple temporal scales.
Builds onJEPA action-conditioned world models and hierarchical / options-based planning.
Used forLong-horizon zero-shot control at reduced planning-time compute.

Motivation

A flat latent world model advances one timestep at a time, so reaching a distant goal means unrolling the predictor over a long horizon. Two problems compound: per-step prediction errors accumulate into useless rollouts, and the search over action sequences grows prohibitively as the horizon lengthens. Long-horizon control therefore demands temporal abstraction — the ability to reason in coarse steps that each summarise many fine ones. This work introduces a hierarchy of latent predictors so that planning effort is amortised across scales rather than spent on one long, error-prone rollout.

How it works

Videotubelets · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copyaction aₜlocal loss (e.g. MLM)
Canonical JEPA schematic for Video. The input is split into a visible context and hidden targets (tubelet-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model. A local/generative loss runs alongside latent prediction (hybrid objective).

On a JEPA-style latent world model — context encoder, action-conditioned predictor $\hat z_{t+1}=g_\phi(z_t,a_t)$, latent prediction loss with anti-collapse — the method adds dynamics at multiple temporal scales. A coarse predictor advances abstract subgoal latents over long intervals, $\hat z_{t+K}=g_\phi^{\text{hi}}(z_t,\omega_t)$, while a fine predictor fills in the detailed transitions between them.

Planning is top-down: search first selects a sequence of coarse subgoals across the long horizon, then refines short action segments to reach each subgoal in turn. Because each level handles only a small branching factor and a short rollout, the effective per-level search and rollout length stay small, and the coarse subgoals act as learned options that curb compounding error.

The objective

Predictors at each scale are trained with latent prediction against target-encoder embeddings, e.g. for the fine and coarse levels

$$\mathcal{L} = \big\lVert g_\phi(z_t,a_t)-\operatorname{sg}[\bar f_\theta(o_{t+1})]\big\rVert^2 + \big\lVert g_\phi^{\text{hi}}(z_t,\omega_t)-\operatorname{sg}[\bar f_\theta(o_{t+K})]\big\rVert^2 + \lambda\,\mathcal{R}(Z).$$

Hierarchical planning then solves the long-horizon problem by composing levels: choose coarse subgoals $z^{*}_{1:N}$ over the horizon, then for each segment solve the short control problem $\min_{a}\lVert \hat z_{t+K}-z^{*}_{i}\rVert$, keeping every individual optimisation cheap.

Key results & what's novel

The contribution is hierarchical, multi-timescale planning in latent world models, improving long-horizon zero-shot control while reducing planning-time compute. By amortising the horizon into a hierarchy of latent predictors rather than a single long flat rollout, coarse latents serve as learned options that keep per-level search shallow and limit error accumulation. The work shows temporal abstraction is a key lever for scaling JEPA-based planning beyond the short horizons where flat models succeed.

Strengths & limitations

  • + Reaches longer horizons than flat models while spending less planning compute.
  • + Coarse subgoal latents act as options and curb compounding rollout error.
  • + Retains zero-shot, goal-directed planning in latent space.
  • Requires choosing the temporal scales and training predictors at each.
  • Errors in the coarse level can mislead the entire top-down search.
  • Subgoal latents must be reachable by the fine controller, which is not guaranteed.

Connections & references