Overview

Actiontokens · masked blocksContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copyaction aₜ
Canonical JEPA schematic for Action. The input is split into a visible context and hidden targets (token-level, masked blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model.

Problem. Standard JEPAs mask spatiotemporal patches, entangling distinct objects in a single latent and making counterfactual reasoning ("what if only this entity changes?") and credit assignment difficult. Causal-JEPA restructures masking around objects.

Mechanism. The context encoder produces an object-/entity-centric set of latents; the self-supervised signal comes from masking at the level of objects rather than pixels or patches, so the predictor must infer a masked entity's latent from the others and from actions. Training keeps the JEPA core — predict target-encoder latents under stop-gradient with anti-collapse — but the factored structure means transitions $z_{t+1}=P(z_t,a_t)$ act on disentangled entity slots. This isolates how each entity (and each intervention on it) propagates, yielding causal, modular dynamics.

Contribution. Object-level latent masking gives world models that support counterfactual reasoning and more efficient planning: by reasoning over a sparse set of entities, search and intervention evaluation become combinatorially cheaper and more interpretable than dense pixel-space alternatives.

Significance. For world modeling, factoring the latent into causal entities is a route to systematic generalization and intervention semantics — a step toward genuinely causal JEPAs. For a biological or drug-discovery world model the mapping is direct: objects map to pathways, regulatory programs, or molecular entities. Masking an entity and predicting its latent under a perturbation mirrors a knockout or compound intervention, so the model supports counterfactual queries ("silence this pathway, predict the phenotype") and efficient planning over which programs to target.