Motivation
A musical mix is intrinsically compositional: it decomposes into stems — drums, bass, vocals, harmony — that must be musically and acoustically compatible to sound right together. Tasks like finding an accompaniment for a vocal line, retrieving a bass that fits a groove, or generating a complementary track all require a model of cross-stem fit, not merely good per-clip embeddings. Standard audio SSL learns within-signal structure but says nothing about whether two simultaneously sounding sources belong together. Stem-JEPA targets exactly this relational notion of compatibility.
How it works
Stem-JEPA reframes compatibility as a masked-prediction task where the masking unit is a whole stem.
- A context encoder $f_\theta$ embeds a mix of some subset of the stems.
- A predictor $g_\phi$, conditioned on the type of the held-out (target) stem, predicts the latent representation of that missing source.
- A target encoder $\bar f_\theta$ (EMA) independently embeds the true held-out stem to provide the target.
Because one source is removed and reconstructed only in embedding space from the remaining mix, the model must learn the relational structure among instruments that play together. Distance in the resulting space encodes compatibility, and the type conditioning enables type-aware retrieval — e.g. "find a drum stem for this mix".
The objective
Let the target stem be of type $\tau$ and let $x_{\setminus s}$ be the mix without it. Training minimizes the latent distance between the predicted and true stem embeddings:
$$\mathcal{L} = \big\lVert\, g_\phi\big(f_\theta(x_{\setminus s}),\, \tau\big) - \operatorname{sg}\big[\bar f_\theta(x_s)\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and $\bar f_\theta$ updated by EMA. The conditioning on the target type $\tau$ lets a single predictor handle any missing instrument, so the learned geometry captures which stems fit a given context rather than only how to embed an isolated clip.
Key results & what's novel
Stem-JEPA generalizes JEPA's masked-prediction principle from within-signal patches to between-source relations: the held-out unit is an entire musical source, not a spectrogram block. The learned space is metric for compatibility — nearby embeddings correspond to stems that fit together — which directly supports stem retrieval, accompaniment matching, and compatibility estimation. The novelty is showing that the latent-prediction objective adapts cleanly to a compositional, multi-source modality, where the structure to be modeled is the relationship among simultaneously sounding parts rather than the internal texture of one signal.
Strengths & limitations
- + Learns a directly usable compatibility metric for music retrieval and accompaniment.
- + Type-conditioned predictor handles arbitrary missing instruments with one model.
- + Extends JEPA to relational, multi-source structure rather than within-signal patches.
- − Requires stem-separated (multitrack) training data, which is scarcer than mixes.
- − Compatibility as latent distance is an approximation; musical "fit" is partly subjective and context-dependent.
- − Predicting an expected stem embedding cannot capture the multimodality of equally valid accompaniments.