Multimodal JEPA for Imaging and Clinical Signatures

At a glance

ProblemMedical imaging and structured clinical variables are usually modelled separately, missing cross-modal structure, while supervised fusion needs large aligned labelled cohorts that rarely exist.

Key ideaA multimodal JEPA that predicts, in latent space, the representation of one modality (or a masked component) from another — learning joint imaging-and-clinical signatures.

ModalityImaging + clinical

Target / maskingMask across modalities/components; a target encoder supplies stop-gradient latent targets.

Builds onI-JEPA's latent-prediction recipe, extended across modalities.

Used forFused imaging-clinical representations from incompletely labelled cohorts.

Motivation

Patient data is inherently multimodal: medical imaging alongside structured clinical variables. These are typically modelled separately, which misses the cross-modal structure linking an imaging phenotype to clinical state. The obvious alternative — supervised fusion — demands large, aligned, labelled cohorts that rarely exist. Li et al. (2025) target this gap: a self-supervised representation that integrates imaging and clinical data into a single space without requiring dense aligned labels.

How it works

Canonical JEPA schematic for Imaging + clinical. The input is split into a visible context and hidden targets (modality component-level, cross-modal). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The method applies a multimodal JEPA.

A context encoder embeds one or more modalities.
A predictor predicts, in latent space, the representation of a target modality or masked component.
A target encoder produces that target representation, with stop-gradient.

The masking / prediction unit spans modalities: the model learns to predict latent clinical signatures from imaging and vice versa, capturing joint imaging-and-clinical signatures without reconstructing raw pixels or tabular values. Because prediction is cross-modal and in latent space, the model is forced to encode how an imaging phenotype relates to clinical variables rather than treating each stream in isolation.

The objective

The loss is the latent distance between the predicted and target representations of the held-out modality/component:

$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}[\bar f_\theta(x_{\text{tgt}})]\,\big\rVert_2^2,$$

with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and target encoder $\bar f_\theta$. Predicting representations rather than raw pixels or tabular values is what lets the model learn cross-modal structure from incompletely labelled, real-world cohorts.

Key results & what's novel

The key idea is cross-modal latent prediction as a route to fused representations that encode how imaging phenotypes relate to clinical state. The novelty is extending the joint-embedding predictive principle across modalities rather than within a single one: instead of contrastive alignment of image-text pairs, the model predicts one modality's latent representation from another's. By learning this structure self-supervised, it leverages incompletely labelled cohorts, improving the link between imaging phenotypes and clinical state without large aligned labelled datasets.

Strengths & limitations

+ Learns cross-modal imaging-clinical structure self-supervised, without large aligned labelled cohorts.
+ Latent prediction avoids reconstructing raw pixels or tabular values.
+ Uses incompletely labelled real-world cohorts that supervised fusion cannot.
− Heterogeneous imaging and tabular modalities are hard to tokenise and align.
− Quality depends on how well the modalities are paired per patient.
− Produces a fused representation, not a causal or dynamics model.

Connections & references

Builds onI-JEPA

RelatedEchoJEPA US-JEPA Brain-JEPA

Paper ↗