At a glance
ProblemLearn strong image representations without hand-crafted augmentations and without wasting capacity reconstructing pixels.
Key ideaPredict the representations of masked target blocks from a single visible context block, entirely in latent space.
ModalityImages (Vision Transformer)
Target / maskingMulti-block: ~4 target blocks predicted from one large context block, at the level of ViT patches.
Builds onLeCun's energy-based, prediction-in-latent-space programme; EMA self-distillation (BYOL/DINO lineage).
Used forGeneral-purpose visual backbones, linear-probe and low-shot classification, transfer.

Motivation

Two dominant self-supervised recipes each pay a tax. Invariance-based methods (SimCLR, BYOL, DINO) learn semantics but depend on a hand-engineered set of augmentations — crops, color jitter, blur — that encode human priors about what should be invariant; those priors do not transfer cleanly to non-natural images. Reconstruction-based methods (MAE) need no augmentations but predict in pixel space, spending capacity on high-frequency detail that is irrelevant to semantics. I-JEPA asks for the best of both: no augmentations, and a target that is abstract rather than pixel-level.

How it works

Imagepatchs · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Image. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

An image is split into a regular grid of patches. I-JEPA samples one large context block and several smaller target blocks (and removes the target regions from the context so the task cannot be solved by simple copying).

  • The context encoder $f_\theta$ — a ViT — embeds only the visible context patches into tokens $z_{\text{ctx}}$.
  • The target encoder $\bar f_\theta$ is an exponential-moving-average copy of $f_\theta$; it embeds the full image, and the embeddings at the target-block positions become the prediction targets (gradients are stopped here).
  • The predictor $g_\phi$, a narrow ViT, takes the context tokens plus learnable mask tokens carrying the positional information of each target block, and predicts the target representations.

Because the targets are representations of full image regions, solving the task requires modelling the relationships between parts of a scene — which is exactly what yields semantic features.

The objective

For target blocks $k=1\dots M$, the loss is the latent $\ell_2$ distance between predicted and target representations:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2$$

where $\operatorname{sg}$ is stop-gradient and $m_k$ is the mask token for block $k$. The target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$, which (together with the predictor and multi-block masking) prevents representational collapse.

Key results & what's novel

I-JEPA is the first JEPA for images and the template the whole family copies. With ViT-Huge pretrained on ImageNet it learns semantically rich features without any augmentation, is competitive with view-invariance methods on linear probing, and is markedly more compute-efficient to pretrain than pixel-reconstruction approaches at comparable quality. Its representations are label-efficient in the low-shot regime and transfer across tasks. The novelty is conceptual: prediction of abstract targets in latent space is a sufficient learning signal — no contrastive negatives, no augmentation engineering, no pixel decoder.

Strengths & limitations

  • + No hand-crafted augmentations; largely domain-agnostic recipe.
  • + Efficient pretraining; strong off-the-shelf and low-shot features.
  • The masking design (block scale and count) matters and needs tuning.
  • The predictor regresses an expected target, which can wash out fine multimodal detail; I-JEPA is not generative.
  • It learns a static representation — there is no notion of dynamics or action, so it is a representation learner, not yet a world model.

Connections & references