Motivation
Medical ultrasound is dominated by speckle — a low-signal-to-noise, multiplicative interference pattern. Pixel-reconstruction objectives try (wastefully) to model speckle exactly, even though it is irreducibly unpredictable. Labelled ultrasound is also scarce. US-JEPA (Radhachandran et al., 2026) aims for a self-supervised encoder that captures diagnostic anatomy while ignoring the speckle noise, on the premise that latent prediction is especially advantageous when the signal carries large irreducible noise.
How it works
US-JEPA uses masked latent prediction.
- A context encoder embeds visible image regions.
- A predictor predicts the latent embeddings of masked blocks.
- The targets are supplied by a frozen domain teacher — a domain-pretrained target encoder — rather than by an EMA branch tracking the student.
The masking unit is spatial image blocks. Using a frozen domain teacher is the distinguishing design choice: it provides stable, semantically meaningful targets throughout training, instead of the moving EMA target most JEPAs use. Because the speckle noise is unpredictable, latent targets force the model to encode structure, not particular speckle realisations.
The objective
The loss is the latent distance between predicted embeddings and the frozen teacher's embeddings of the masked blocks:
$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\,T(x)_k\,]\,\big\rVert_2^2,$$
where $T$ is the frozen domain teacher (held fixed, so no EMA update) and $\operatorname{sg}$ is stop-gradient. The frozen teacher gives a stationary target, which both stabilises training and supplies semantically meaningful structure for the student to predict.
Key results & what's novel
The explicit finding is that predicting in latent space handles low-SNR speckle better than pixel reconstruction: because the noise is unpredictable, latent targets force the model to encode anatomy rather than speckle realisations. The transferable lesson — and the broader claim for noisy biomedical signals — is that the JEPA principle is especially advantageous precisely when the signal contains large irreducible noise. The methodological novelty is the use of a frozen domain teacher for stable, semantically meaningful targets in place of the usual EMA branch.
Strengths & limitations
- + Latent prediction handles low-SNR speckle better than pixel reconstruction.
- + Frozen domain teacher gives stable, semantically meaningful targets.
- + Augmentation-free; well matched to noisy ultrasound.
- − Requires a suitable domain-pretrained teacher to freeze.
- − A frozen target cannot co-adapt with the student as an EMA target would.
- − Static encoder; learns representations, not dynamics.