Overview
Problem. Standard JEPAs mask spatiotemporal patches, entangling distinct objects in a single latent and making counterfactual reasoning ("what if only this entity changes?") and credit assignment difficult. Causal-JEPA restructures masking around objects.
Mechanism. The context encoder produces an object-/entity-centric set of latents; the self-supervised signal comes from masking at the level of objects rather than pixels or patches, so the predictor must infer a masked entity's latent from the others and from actions. Training keeps the JEPA core — predict target-encoder latents under stop-gradient with anti-collapse — but the factored structure means transitions $z_{t+1}=P(z_t,a_t)$ act on disentangled entity slots. This isolates how each entity (and each intervention on it) propagates, yielding causal, modular dynamics.
Contribution. Object-level latent masking gives world models that support counterfactual reasoning and more efficient planning: by reasoning over a sparse set of entities, search and intervention evaluation become combinatorially cheaper and more interpretable than dense pixel-space alternatives.
Significance. For world modeling, factoring the latent into causal entities is a route to systematic generalization and intervention semantics — a step toward genuinely causal JEPAs. For a biological or drug-discovery world model the mapping is direct: objects map to pathways, regulatory programs, or molecular entities. Masking an entity and predicting its latent under a perturbation mirrors a knockout or compound intervention, so the model supports counterfactual queries ("silence this pathway, predict the phenotype") and efficient planning over which programs to target.