Motivation
A flat latent world model advances one timestep at a time, so reaching a distant goal means unrolling the predictor over a long horizon. Two problems compound: per-step prediction errors accumulate into useless rollouts, and the search over action sequences grows prohibitively as the horizon lengthens. Long-horizon control therefore demands temporal abstraction — the ability to reason in coarse steps that each summarise many fine ones. This work introduces a hierarchy of latent predictors so that planning effort is amortised across scales rather than spent on one long, error-prone rollout.
How it works
On a JEPA-style latent world model — context encoder, action-conditioned predictor $\hat z_{t+1}=g_\phi(z_t,a_t)$, latent prediction loss with anti-collapse — the method adds dynamics at multiple temporal scales. A coarse predictor advances abstract subgoal latents over long intervals, $\hat z_{t+K}=g_\phi^{\text{hi}}(z_t,\omega_t)$, while a fine predictor fills in the detailed transitions between them.
Planning is top-down: search first selects a sequence of coarse subgoals across the long horizon, then refines short action segments to reach each subgoal in turn. Because each level handles only a small branching factor and a short rollout, the effective per-level search and rollout length stay small, and the coarse subgoals act as learned options that curb compounding error.
The objective
Predictors at each scale are trained with latent prediction against target-encoder embeddings, e.g. for the fine and coarse levels
$$\mathcal{L} = \big\lVert g_\phi(z_t,a_t)-\operatorname{sg}[\bar f_\theta(o_{t+1})]\big\rVert^2 + \big\lVert g_\phi^{\text{hi}}(z_t,\omega_t)-\operatorname{sg}[\bar f_\theta(o_{t+K})]\big\rVert^2 + \lambda\,\mathcal{R}(Z).$$
Hierarchical planning then solves the long-horizon problem by composing levels: choose coarse subgoals $z^{*}_{1:N}$ over the horizon, then for each segment solve the short control problem $\min_{a}\lVert \hat z_{t+K}-z^{*}_{i}\rVert$, keeping every individual optimisation cheap.
Key results & what's novel
The contribution is hierarchical, multi-timescale planning in latent world models, improving long-horizon zero-shot control while reducing planning-time compute. By amortising the horizon into a hierarchy of latent predictors rather than a single long flat rollout, coarse latents serve as learned options that keep per-level search shallow and limit error accumulation. The work shows temporal abstraction is a key lever for scaling JEPA-based planning beyond the short horizons where flat models succeed.
Strengths & limitations
- + Reaches longer horizons than flat models while spending less planning compute.
- + Coarse subgoal latents act as options and curb compounding rollout error.
- + Retains zero-shot, goal-directed planning in latent space.
- − Requires choosing the temporal scales and training predictors at each.
- − Errors in the coarse level can mislead the entire top-down search.
- − Subgoal latents must be reachable by the fine controller, which is not guaranteed.