US-JEPA — World Modeling

At a glance

ProblemMedical ultrasound is dominated by speckle — a low-SNR, multiplicative interference pattern that pixel-reconstruction objectives wastefully try to model exactly — and labelled ultrasound is scarce.

Key ideaMasked latent prediction with a frozen domain teacher: predict the latent embeddings of masked blocks so the model encodes anatomy and ignores irreducible speckle noise.

ModalityUltrasound

Target / maskingMask spatial image blocks; a frozen domain-pretrained teacher (not an EMA branch) supplies the targets.

Builds onI-JEPA's masked latent block prediction.

Used forA speckle-robust self-supervised ultrasound encoder.

Motivation

Medical ultrasound is dominated by speckle — a low-signal-to-noise, multiplicative interference pattern. Pixel-reconstruction objectives try (wastefully) to model speckle exactly, even though it is irreducibly unpredictable. Labelled ultrasound is also scarce. US-JEPA (Radhachandran et al., 2026) aims for a self-supervised encoder that captures diagnostic anatomy while ignoring the speckle noise, on the premise that latent prediction is especially advantageous when the signal carries large irreducible noise.

How it works

Canonical JEPA schematic for Ultrasound. The input is split into a visible context and hidden targets (image block-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an frozen teacher copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

US-JEPA uses masked latent prediction.

A context encoder embeds visible image regions.
A predictor predicts the latent embeddings of masked blocks.
The targets are supplied by a frozen domain teacher — a domain-pretrained target encoder — rather than by an EMA branch tracking the student.

The masking unit is spatial image blocks. Using a frozen domain teacher is the distinguishing design choice: it provides stable, semantically meaningful targets throughout training, instead of the moving EMA target most JEPAs use. Because the speckle noise is unpredictable, latent targets force the model to encode structure, not particular speckle realisations.

The objective

The loss is the latent distance between predicted embeddings and the frozen teacher's embeddings of the masked blocks:

$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\,T(x)_k\,]\,\big\rVert_2^2,$$

where $T$ is the frozen domain teacher (held fixed, so no EMA update) and $\operatorname{sg}$ is stop-gradient. The frozen teacher gives a stationary target, which both stabilises training and supplies semantically meaningful structure for the student to predict.

Key results & what's novel

The explicit finding is that predicting in latent space handles low-SNR speckle better than pixel reconstruction: because the noise is unpredictable, latent targets force the model to encode anatomy rather than speckle realisations. The transferable lesson — and the broader claim for noisy biomedical signals — is that the JEPA principle is especially advantageous precisely when the signal contains large irreducible noise. The methodological novelty is the use of a frozen domain teacher for stable, semantically meaningful targets in place of the usual EMA branch.

Strengths & limitations

+ Latent prediction handles low-SNR speckle better than pixel reconstruction.
+ Frozen domain teacher gives stable, semantically meaningful targets.
+ Augmentation-free; well matched to noisy ultrasound.
− Requires a suitable domain-pretrained teacher to freeze.
− A frozen target cannot co-adapt with the student as an EMA target would.
− Static encoder; learns representations, not dynamics.

Connections & references

Builds onI-JEPA

Paper ↗