Klindt, LeCun & Balestriero give the cleanest theoretical statement so far of when a JEPA stops being a representation learner and starts being a world model.
The result
Under (i) independent Gaussian latent variables, (ii) stationary additive-noise transitions, and (iii) successful Gaussian regularization (SIGReg, from LeJEPA), a JEPA trained with alignment recovers the world's latent variables linearly, up to rotation. Latent prediction is enough to identify the generative factors — and the paper connects this to optimal planning in the recovered latent space.
This is the formal license behind the whole "JEPA = world model" claim. If the embedding is faithful up to a linear map, then planning in latents is planning in the world.
Why I am cautious for biology
Each assumption is exactly where a cell pushes back:
- Independent Gaussian latents — gene programs are correlated, multimodal, and heavy-tailed.
- Stationary additive-noise transitions — perturbation responses are state-dependent, saturating, and often multiplicative; dose and time bend the dynamics.
- Single clean latent space — the interesting biology lives across modalities (DNA, RNA, protein, chemistry), which is a fusion problem the theorem does not address.
What I take from it
Use the theory as design guidance, not a guarantee. Concretely: prefer SIGReg-style isotropic-Gaussian regularization as a default anti-collapse; probe whether the learned latent dimensions actually align with known biological factors; and use real perturbation and temporal data to break the non-identifiability the theory warns about. The math tells you what a faithful latent looks like; the wet-lab data tells you whether you got one.