Motivation
Patient data is inherently multimodal: medical imaging alongside structured clinical variables. These are typically modelled separately, which misses the cross-modal structure linking an imaging phenotype to clinical state. The obvious alternative — supervised fusion — demands large, aligned, labelled cohorts that rarely exist. Li et al. (2025) target this gap: a self-supervised representation that integrates imaging and clinical data into a single space without requiring dense aligned labels.
How it works
The method applies a multimodal JEPA.
- A context encoder embeds one or more modalities.
- A predictor predicts, in latent space, the representation of a target modality or masked component.
- A target encoder produces that target representation, with stop-gradient.
The masking / prediction unit spans modalities: the model learns to predict latent clinical signatures from imaging and vice versa, capturing joint imaging-and-clinical signatures without reconstructing raw pixels or tabular values. Because prediction is cross-modal and in latent space, the model is forced to encode how an imaging phenotype relates to clinical variables rather than treating each stream in isolation.
The objective
The loss is the latent distance between the predicted and target representations of the held-out modality/component:
$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}[\bar f_\theta(x_{\text{tgt}})]\,\big\rVert_2^2,$$
with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and target encoder $\bar f_\theta$. Predicting representations rather than raw pixels or tabular values is what lets the model learn cross-modal structure from incompletely labelled, real-world cohorts.
Key results & what's novel
The key idea is cross-modal latent prediction as a route to fused representations that encode how imaging phenotypes relate to clinical state. The novelty is extending the joint-embedding predictive principle across modalities rather than within a single one: instead of contrastive alignment of image-text pairs, the model predicts one modality's latent representation from another's. By learning this structure self-supervised, it leverages incompletely labelled cohorts, improving the link between imaging phenotypes and clinical state without large aligned labelled datasets.
Strengths & limitations
- + Learns cross-modal imaging-clinical structure self-supervised, without large aligned labelled cohorts.
- + Latent prediction avoids reconstructing raw pixels or tabular values.
- + Uses incompletely labelled real-world cohorts that supervised fusion cannot.
- − Heterogeneous imaging and tabular modalities are hard to tokenise and align.
- − Quality depends on how well the modalities are paired per patient.
- − Produces a fused representation, not a causal or dynamics model.