RadJEPA — World Modeling

At a glance

ProblemExpert-annotated chest radiographs are limited relative to the huge volume of unlabelled scans, and pixel-reconstruction objectives squander capacity on imaging texture and noise rather than diagnostic structure.

Key ideaApply the I-JEPA recipe to chest X-rays: predict the latent embeddings of masked image blocks, forcing prediction of anatomical and pathological structure in representation space.

ModalityChest X-ray

Target / maskingMask spatial image blocks; an EMA target encoder supplies stop-gradient latent targets.

Builds onI-JEPA's masked latent block prediction.

Used forA transferable chest-radiograph encoder learned without dense labels.

Motivation

Chest radiographs (X-rays) are a workhorse clinical readout, but labelled, expert-annotated images are limited relative to the huge volume of unlabelled scans. Pixel-reconstruction objectives waste model capacity on imaging texture and noise rather than diagnostic structure, and natural-image augmentations can destroy clinically meaningful signal. RadJEPA (Khan et al., 2026) aims for a strong self-supervised encoder for chest X-rays that learns from abundant unlabelled scans without dense labels.

How it works

Canonical JEPA schematic for Chest X-ray. The input is split into a visible context and hidden targets (image block-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

RadJEPA applies the I-JEPA-style recipe to radiographs.

A context encoder embeds a visible portion of the image.
A predictor predicts the latent embeddings of masked image blocks.
An EMA target encoder supplies stop-gradient targets via a latent prediction loss.

The masking unit is spatial image blocks, forcing prediction of anatomical and pathological structure in representation space rather than interpolation of pixel texture. The augmentation-free latent objective is well suited to medical imaging, where the crops, jitter and blur used for natural images can destroy diagnostically relevant signal.

The objective

The loss is the latent distance over masked radiograph blocks:

$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\bar f_\theta(x)_k]\,\big\rVert_2^2,$$

with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and EMA target encoder $\bar f_\theta$. There are no contrastive negatives and no augmentations; the block masking plus EMA target supply the learning signal, and latent targets keep the model from fitting irrelevant pixel texture.

Key results & what's novel

The key contribution is a transferable chest-radiograph encoder learned without dense labels. The significance is domain fit: the augmentation-free, latent-target recipe is well matched to medical imaging precisely because natural-image augmentations are unsafe here, and pixel reconstruction wastes capacity on texture. By pretraining on abundant unlabelled radiographs, RadJEPA reduces the annotation burden that constrains imaging-encoder development for chest X-ray.

Strengths & limitations

+ Augmentation-free recipe avoids destroying clinically meaningful X-ray signal.
+ Latent block targets concentrate capacity on anatomy/pathology, not pixel texture.
+ Learns from abundant unlabelled radiographs, reducing the annotation burden.
− Block masking scale and count need tuning for radiographs.
− Static encoder; no notion of dynamics or longitudinal change.
− Downstream quality bounded by the diversity of the unlabelled pretraining set.

Connections & references

Builds onI-JEPA

Paper ↗