Overview

Actiontokens · masked blocksContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copyaction aₜ
Canonical JEPA schematic for Action. The input is split into a visible context and hidden targets (token-level, masked blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model.

Problem. End-to-end JEPA world models from pixels are notoriously unstable: jointly learning encoder, predictor, and action conditioning risks representational collapse, and the usual fix — an EMA teacher/target encoder — adds fragility and hyperparameter sensitivity. LeWorldModel removes the teacher.

Mechanism. A context encoder embeds pixels and a predictor advances the latent under actions, $z_{t+1}=P(z_t,a_t)$, learned directly from pixels, end-to-end. The objective has two terms: a latent prediction term that drives action-conditioned dynamics, and an anti-collapse term using SIGReg (a distributional regularizer that constrains the latent's isotropy/Gaussianity) in place of a stop-gradient EMA teacher. SIGReg supplies the collapse-prevention pressure that the teacher normally provides, yielding a stable single-network training signal.

Contribution. A genuinely stable, teacher-free recipe for action-conditioned latent world models from raw pixels, simplifying the JEPA-for-control pipeline while retaining plannable latent dynamics.

Significance. For world modeling, eliminating the EMA teacher reduces a major source of instability and makes end-to-end learning of controllable latent dynamics practical, lowering the barrier to model-based planning. For a biological or drug-discovery world model, teacher-free stability matters because perturbation datasets are small and noisy: a SIGReg-regularized objective can learn action-conditioned latent dynamics over cellular states — with interventions as actions — without the delicate teacher tuning that often fails in low-data scientific regimes, while keeping the latent well-conditioned for downstream planning.