ACT-JEPA — World Modeling

Overview

Canonical JEPA schematic for Action. The input is split into a visible context and hidden targets (token-level, masked blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model.

Problem. Imitation and behavior-cloning policies are data-hungry and brittle because they regress raw actions without a structured predictive model of how observations evolve. ACT-JEPA injects world-model structure into action representation learning.

Mechanism. Building on the JEPA template, a context encoder embeds the agent's observation (and action) history; a predictor is trained to jointly predict latent representations of future actions and future observations, rather than reconstructing pixels or raw control. The objective is a latent prediction loss against target encodings $\lVert P(z_{\text{ctx}}) - \text{sg}(z_{\text{tgt}})\rVert$, with stop-gradient and standard anti-collapse so the action and observation streams do not degenerate. Masking over time supplies the self-supervised signal, letting abstract action sequences and their consequences be learned together.

Contribution. Coupling action prediction with observation-latent prediction yields representations that capture both intent and consequence, improving policy quality and sample efficiency over pure behavior cloning while remaining a single JEPA-style objective.

Significance. For world modeling, ACT-JEPA shows the predictor can be the locus of policy learning: a model that anticipates its own future actions in latent space is implicitly a controllable world model usable for planning. For a biological or drug-discovery world model, the joint formulation is natural — treat interventions as the action stream and cellular state as the observation stream, so the model learns both which perturbation to apply and the predicted latent phenotypic response, supporting efficient intervention-policy design from limited experimental data.