Image World Models (IWM) — World Modeling

At a glance

ProblemI-JEPA only inpaints masked content; it does not model how an image changes under explicit transformations or interventions.

Key ideaCondition the latent predictor on the transformation, turning the JEPA into an Image World Model that predicts transformed-image embeddings.

ModalityImages

Target / maskingBlock masking plus photometric/geometric augmentations; predict the EMA target encoder's embedding of the transformed image.

Builds onI-JEPA's context encoder, EMA target encoder and latent predictor.

Used forLearning transformation-equivariant features and a reusable, controllable world-model predictor.

Motivation

I-JEPA predicts the embeddings of masked image blocks from a visible context — effectively latent inpainting. IWM (Garrido et al., 2024) argues a genuine world model should do more: it should understand how the world changes under interventions. The paper generalises I-JEPA from masked prediction to predicting the effect of arbitrary transformations in representation space, so the predictor becomes a model of how images transform rather than merely a tool for filling in missing patches.

How it works

Canonical JEPA schematic for Image. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. The predictor is action-conditioned: $\hat z_{t+1}=g_\phi(z_t,a_t)$ — this is what turns a representation learner into a world model.

IWM keeps the JEPA backbone: a context encoder, an EMA target encoder that produces the prediction targets, and a predictor operating in latent space. The key change is that the predictor is conditioned on the transformation — masking together with photometric and geometric augmentations such as colour jitter, blur or cropping. Given the embedding of a source image and a description of the transformation, the predictor must output the target encoder's embedding of the transformed image. Collapse is avoided via the EMA target and the encoder-predictor asymmetry, in the I-JEPA tradition; only the conditioning and the choice of transformations are new.

The objective

The predictor $g_\phi$ is trained to map a source embedding and a transformation code $t$ to the target embedding of the transformed image:

$$\mathcal{L} = \big\lVert g_\phi\big(f_\theta(x),\,t\big) - \operatorname{sg}\big[f_{\bar\theta}(t(x))\big] \big\rVert^2$$

where $f_\theta$ is the context encoder, $f_{\bar\theta}$ the EMA target encoder and $t$ the applied transformation. The capacity of the predictor controls the outcome: a strong world model learns transformation-equivariant representations, while a weak one forces the encoder to discard transformation information, yielding more invariant features.

Key results & what's novel

IWM shows the quality of the world model directly governs representation properties: a powerful, capable predictor produces equivariant representations and improves downstream performance, whereas a weak predictor pushes invariance into the encoder. Crucially, the learned predictor is not discarded after pretraining — it can be reused and fine-tuned for downstream tasks, treating the world model as a transferable module. This reframes self-supervised visual learning as learning a controllable, transformation-conditioned predictive model, generalising I-JEPA beyond inpainting.

Strengths & limitations

+ Generalises I-JEPA to arbitrary transformations, bridging SSL and intervention-conditioned world models.
+ Reusable predictor; tunable equivariance-vs-invariance via predictor capacity.
+ Clarifies that world-model quality shapes representation quality.
− Limited to known, parameterisable image transformations rather than real-world actions.
− Still depends on EMA target and stop-gradient for stability.

Connections & references

Builds onI-JEPA V-JEPA

Paper ↗