Motivation
Two dominant self-supervised recipes each pay a tax. Invariance-based methods (SimCLR, BYOL, DINO) learn semantics but depend on a hand-engineered set of augmentations — crops, color jitter, blur — that encode human priors about what should be invariant; those priors do not transfer cleanly to non-natural images. Reconstruction-based methods (MAE) need no augmentations but predict in pixel space, spending capacity on high-frequency detail that is irrelevant to semantics. I-JEPA asks for the best of both: no augmentations, and a target that is abstract rather than pixel-level.
How it works
An image is split into a regular grid of patches. I-JEPA samples one large context block and several smaller target blocks (and removes the target regions from the context so the task cannot be solved by simple copying).
- The context encoder $f_\theta$ — a ViT — embeds only the visible context patches into tokens $z_{\text{ctx}}$.
- The target encoder $\bar f_\theta$ is an exponential-moving-average copy of $f_\theta$; it embeds the full image, and the embeddings at the target-block positions become the prediction targets (gradients are stopped here).
- The predictor $g_\phi$, a narrow ViT, takes the context tokens plus learnable mask tokens carrying the positional information of each target block, and predicts the target representations.
Because the targets are representations of full image regions, solving the task requires modelling the relationships between parts of a scene — which is exactly what yields semantic features.
The objective
For target blocks $k=1\dots M$, the loss is the latent $\ell_2$ distance between predicted and target representations:
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2$$
where $\operatorname{sg}$ is stop-gradient and $m_k$ is the mask token for block $k$. The target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$, which (together with the predictor and multi-block masking) prevents representational collapse.
Key results & what's novel
I-JEPA is the first JEPA for images and the template the whole family copies. With ViT-Huge pretrained on ImageNet it learns semantically rich features without any augmentation, is competitive with view-invariance methods on linear probing, and is markedly more compute-efficient to pretrain than pixel-reconstruction approaches at comparable quality. Its representations are label-efficient in the low-shot regime and transfer across tasks. The novelty is conceptual: prediction of abstract targets in latent space is a sufficient learning signal — no contrastive negatives, no augmentation engineering, no pixel decoder.
Strengths & limitations
- + No hand-crafted augmentations; largely domain-agnostic recipe.
- + Efficient pretraining; strong off-the-shelf and low-shot features.
- − The masking design (block scale and count) matters and needs tuning.
- − The predictor regresses an expected target, which can wash out fine multimodal detail; I-JEPA is not generative.
- − It learns a static representation — there is no notion of dynamics or action, so it is a representation learner, not yet a world model.