At a glance
ProblemCardiac function assessment from echocardiography is operator-dependent, acoustically noisy and label-scarce, while raw-pixel modelling wastes capacity on speckle and probe variation.
Key ideaA latent predictive foundation model for echocardiography video: predict masked spatiotemporal regions in latent space, so the model infers cardiac dynamics rather than reconstructing pixels.
ModalityEchocardiography video
Target / maskingMask spatiotemporal video patches; an EMA target encoder supplies stop-gradient targets.
Builds onV-JEPA's spatiotemporal latent-prediction recipe.
Used forLVEF / RVSP estimation, sample-efficient and acoustically robust cardiac assessment, pediatric zero-shot.

Motivation

Assessing cardiac function from echocardiography is hard: measurements are operator-dependent, acoustically noisy, and label-scarce. Raw-pixel modelling wastes capacity on speckle and probe variation rather than the cardiac motion that matters. EchoJEPA (Munim et al., 2026) builds a foundation model for echo video aimed at a robust, sample-efficient latent representation of cardiac motion that transfers across populations and acquisition conditions.

How it works

Echocardiography videospatiotemporal patchs · spatiotemporalContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Echocardiography video. The input is split into a visible context and hidden targets (spatiotemporal patch-level, spatiotemporal). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

EchoJEPA applies the V-JEPA-style spatiotemporal recipe.

  • A context encoder embeds visible echo video tokens.
  • A predictor predicts the latent representations of masked spatiotemporal regions.
  • An EMA target encoder supplies stop-gradient targets via a latent prediction loss.

The masking unit is spatiotemporal video patches, so the model infers cardiac dynamics in representation space rather than reconstructing pixels. It is trained at scale on 18M echocardiograms across 300K patients, which is what lets the latent representation generalise across acquisition conditions and populations.

The objective

The loss is the latent distance over masked spatiotemporal echo patches:

$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\bar f_\theta(x)_k]\,\big\rVert_2^2,$$

with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and EMA target encoder $\bar f_\theta$. Predicting latent targets, not pixels, is what keeps the model from spending capacity on speckle and probe-specific texture and instead encodes cardiac motion.

Key results & what's novel

EchoJEPA reports improved estimation of LVEF (left-ventricular ejection fraction) and RVSP (right-ventricular systolic pressure), better sample efficiency, robustness to acoustic variation, and strong pediatric zero-shot generalisation. The pediatric zero-shot result is notable: a representation trained largely on adult data transfers to children without retraining, indicating the latent space captures cardiac motion rather than population-specific appearance. The novelty is scale and modality fit — a V-JEPA-style model trained on 18M echocardiograms across 300K patients as a cardiac-video foundation model.

Strengths & limitations

  • + Trained at large scale (18M echos, 300K patients); improves LVEF and RVSP estimation.
  • + Sample-efficient and robust to acoustic variation; strong pediatric zero-shot transfer.
  • + Latent targets avoid wasting capacity on speckle and probe variation.
  • The large-scale training corpus is a barrier to reproduction.
  • Spatiotemporal masking design for echo needs tuning.
  • A representation model of cardiac motion, not an action-conditioned world model.

Connections & references

Builds onV-JEPA 2