At a glance
ProblemUnlabelled ECG recordings are abundant but expert-labelled diagnoses are scarce, and the clinically meaningful signal is morphology and rhythm structure, not exact sample-level waveform values.
Key ideaPretrain ECG representations by predicting the latent embeddings of masked temporal segments, so the model encodes waveform structure rather than reconstructing raw samples.
ModalityECG
Target / maskingMask temporal segments across leads; an EMA-style target encoder supplies stop-gradient targets.
Builds onI-JEPA / V-JEPA latent-prediction recipe applied to physiological time series.
Used forDownstream ECG diagnostic classification.

Motivation

ECG interpretation has a characteristic data shape: large volumes of unlabelled recordings exist, but expert-labelled, diagnosis-annotated data are limited. The clinically meaningful signal is the morphology and rhythm structure of the trace, not the exact sample-level waveform values, which contain irreducible high-frequency noise. ECG-JEPA (Weimann et al., 2024) seeks a self-supervised pretraining objective whose representations transfer to downstream diagnostic classification, exploiting the abundant unlabelled signal.

How it works

ECGlead-time segments · temporal segmentsContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for ECG. The input is split into a visible context and hidden targets (lead-time segment-level, temporal segments). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

ECG-JEPA applies the joint-embedding predictive recipe to electrocardiogram signals.

  • A context encoder embeds visible portions of the ECG.
  • A predictor predicts the latent representations of masked segments.
  • An EMA-style target encoder supplies stop-gradient targets via a latent prediction loss.

The masking unit is temporal segments (across leads), so the model infers the latent structure of missing waveform regions rather than reconstructing raw samples. Because the target is a representation, not a sample, the objective sidesteps the irreducible high-frequency noise of the raw trace and concentrates capacity on rhythm and morphology.

The objective

The loss is the latent distance over masked temporal ECG segments:

$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\bar f_\theta(x)_k]\,\big\rVert_2^2,$$

with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and EMA target encoder $\bar f_\theta$. No augmentations or contrastive negatives are required; predicting latent targets is what avoids fitting sample-level noise.

Key results & what's novel

The key result is that JEPA pretraining improves downstream ECG classification relative to training without such self-supervised representations. This confirms that the latent-prediction principle, developed for vision, transfers to physiological time series: the same recipe of masked latent prediction with an EMA target works on 1D multi-lead biosignals. The practical value is that it exploits abundant unlabelled ECGs to improve classifier performance exactly where annotated, diagnosis-relevant events are rare.

Strengths & limitations

  • + Improves downstream ECG classification by exploiting abundant unlabelled recordings.
  • + Latent targets avoid fitting irreducible sample-level waveform noise.
  • + Augmentation-free; demonstrates JEPA transfers to physiological time series.
  • Demonstrated as a representation/pretraining gain, not a dynamics model.
  • Segment masking design for multi-lead ECG needs tuning.
  • Downstream quality still bounded by the diversity of the pretraining ECGs.

Connections & references

Builds onI-JEPA