Motivation
Chest radiographs (X-rays) are a workhorse clinical readout, but labelled, expert-annotated images are limited relative to the huge volume of unlabelled scans. Pixel-reconstruction objectives waste model capacity on imaging texture and noise rather than diagnostic structure, and natural-image augmentations can destroy clinically meaningful signal. RadJEPA (Khan et al., 2026) aims for a strong self-supervised encoder for chest X-rays that learns from abundant unlabelled scans without dense labels.
How it works
RadJEPA applies the I-JEPA-style recipe to radiographs.
- A context encoder embeds a visible portion of the image.
- A predictor predicts the latent embeddings of masked image blocks.
- An EMA target encoder supplies stop-gradient targets via a latent prediction loss.
The masking unit is spatial image blocks, forcing prediction of anatomical and pathological structure in representation space rather than interpolation of pixel texture. The augmentation-free latent objective is well suited to medical imaging, where the crops, jitter and blur used for natural images can destroy diagnostically relevant signal.
The objective
The loss is the latent distance over masked radiograph blocks:
$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\bar f_\theta(x)_k]\,\big\rVert_2^2,$$
with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and EMA target encoder $\bar f_\theta$. There are no contrastive negatives and no augmentations; the block masking plus EMA target supply the learning signal, and latent targets keep the model from fitting irrelevant pixel texture.
Key results & what's novel
The key contribution is a transferable chest-radiograph encoder learned without dense labels. The significance is domain fit: the augmentation-free, latent-target recipe is well matched to medical imaging precisely because natural-image augmentations are unsafe here, and pixel reconstruction wastes capacity on texture. By pretraining on abundant unlabelled radiographs, RadJEPA reduces the annotation burden that constrains imaging-encoder development for chest X-ray.
Strengths & limitations
- + Augmentation-free recipe avoids destroying clinically meaningful X-ray signal.
- + Latent block targets concentrate capacity on anatomy/pathology, not pixel texture.
- + Learns from abundant unlabelled radiographs, reducing the annotation burden.
- − Block masking scale and count need tuning for radiographs.
- − Static encoder; no notion of dynamics or longitudinal change.
- − Downstream quality bounded by the diversity of the unlabelled pretraining set.