Overview
Problem. End-to-end JEPA world models from pixels are notoriously unstable: jointly learning encoder, predictor, and action conditioning risks representational collapse, and the usual fix — an EMA teacher/target encoder — adds fragility and hyperparameter sensitivity. LeWorldModel removes the teacher.
Mechanism. A context encoder embeds pixels and a predictor advances the latent under actions, $z_{t+1}=P(z_t,a_t)$, learned directly from pixels, end-to-end. The objective has two terms: a latent prediction term that drives action-conditioned dynamics, and an anti-collapse term using SIGReg (a distributional regularizer that constrains the latent's isotropy/Gaussianity) in place of a stop-gradient EMA teacher. SIGReg supplies the collapse-prevention pressure that the teacher normally provides, yielding a stable single-network training signal.
Contribution. A genuinely stable, teacher-free recipe for action-conditioned latent world models from raw pixels, simplifying the JEPA-for-control pipeline while retaining plannable latent dynamics.
Significance. For world modeling, eliminating the EMA teacher reduces a major source of instability and makes end-to-end learning of controllable latent dynamics practical, lowering the barrier to model-based planning. For a biological or drug-discovery world model, teacher-free stability matters because perturbation datasets are small and noisy: a SIGReg-regularized objective can learn action-conditioned latent dynamics over cellular states — with interventions as actions — without the delicate teacher tuning that often fails in low-data scientific regimes, while keeping the latent well-conditioned for downstream planning.