Motivation
I-JEPA predicts the embeddings of masked image blocks from a visible context — effectively latent inpainting. IWM (Garrido et al., 2024) argues a genuine world model should do more: it should understand how the world changes under interventions. The paper generalises I-JEPA from masked prediction to predicting the effect of arbitrary transformations in representation space, so the predictor becomes a model of how images transform rather than merely a tool for filling in missing patches.
How it works
IWM keeps the JEPA backbone: a context encoder, an EMA target encoder that produces the prediction targets, and a predictor operating in latent space. The key change is that the predictor is conditioned on the transformation — masking together with photometric and geometric augmentations such as colour jitter, blur or cropping. Given the embedding of a source image and a description of the transformation, the predictor must output the target encoder's embedding of the transformed image. Collapse is avoided via the EMA target and the encoder-predictor asymmetry, in the I-JEPA tradition; only the conditioning and the choice of transformations are new.
The objective
The predictor $g_\phi$ is trained to map a source embedding and a transformation code $t$ to the target embedding of the transformed image:
$$\mathcal{L} = \big\lVert g_\phi\big(f_\theta(x),\,t\big) - \operatorname{sg}\big[f_{\bar\theta}(t(x))\big] \big\rVert^2$$
where $f_\theta$ is the context encoder, $f_{\bar\theta}$ the EMA target encoder and $t$ the applied transformation. The capacity of the predictor controls the outcome: a strong world model learns transformation-equivariant representations, while a weak one forces the encoder to discard transformation information, yielding more invariant features.
Key results & what's novel
IWM shows the quality of the world model directly governs representation properties: a powerful, capable predictor produces equivariant representations and improves downstream performance, whereas a weak predictor pushes invariance into the encoder. Crucially, the learned predictor is not discarded after pretraining — it can be reused and fine-tuned for downstream tasks, treating the world model as a transferable module. This reframes self-supervised visual learning as learning a controllable, transformation-conditioned predictive model, generalising I-JEPA beyond inpainting.
Strengths & limitations
- + Generalises I-JEPA to arbitrary transformations, bridging SSL and intervention-conditioned world models.
- + Reusable predictor; tunable equivariance-vs-invariance via predictor capacity.
- + Clarifies that world-model quality shapes representation quality.
- − Limited to known, parameterisable image transformations rather than real-world actions.
- − Still depends on EMA target and stop-gradient for stability.