Stem-JEPA — World Modeling

At a glance

ProblemMusic is multitrack: retrieving or generating stems that fit a mix needs a model of cross-stem compatibility, not just per-clip features.

Key ideaCast compatibility as joint-embedding prediction — predict the latent of a held-out stem from a mix of the others, conditioned on stem type.

ModalityMusic audio (multitrack stems)

Target / maskingA whole stem is the masking unit: one source is removed and its embedding predicted from the rest.

Builds onI-JEPA / A-JEPA latent prediction, generalized from within-signal patches to between-source relations.

Used forStem retrieval, accompaniment matching, musical/acoustic compatibility estimation.

Motivation

A musical mix is intrinsically compositional: it decomposes into stems — drums, bass, vocals, harmony — that must be musically and acoustically compatible to sound right together. Tasks like finding an accompaniment for a vocal line, retrieving a bass that fits a groove, or generating a complementary track all require a model of cross-stem fit, not merely good per-clip embeddings. Standard audio SSL learns within-signal structure but says nothing about whether two simultaneously sounding sources belong together. Stem-JEPA targets exactly this relational notion of compatibility.

How it works

Canonical JEPA schematic for Audio spectrogram. The input is split into a visible context and hidden targets (stem-level, stem-out). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

Stem-JEPA reframes compatibility as a masked-prediction task where the masking unit is a whole stem.

A context encoder $f_\theta$ embeds a mix of some subset of the stems.
A predictor $g_\phi$, conditioned on the type of the held-out (target) stem, predicts the latent representation of that missing source.
A target encoder $\bar f_\theta$ (EMA) independently embeds the true held-out stem to provide the target.

Because one source is removed and reconstructed only in embedding space from the remaining mix, the model must learn the relational structure among instruments that play together. Distance in the resulting space encodes compatibility, and the type conditioning enables type-aware retrieval — e.g. "find a drum stem for this mix".

The objective

Let the target stem be of type $\tau$ and let $x_{\setminus s}$ be the mix without it. Training minimizes the latent distance between the predicted and true stem embeddings:

$$\mathcal{L} = \big\lVert\, g_\phi\big(f_\theta(x_{\setminus s}),\, \tau\big) - \operatorname{sg}\big[\bar f_\theta(x_s)\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and $\bar f_\theta$ updated by EMA. The conditioning on the target type $\tau$ lets a single predictor handle any missing instrument, so the learned geometry captures which stems fit a given context rather than only how to embed an isolated clip.

Key results & what's novel

Stem-JEPA generalizes JEPA's masked-prediction principle from within-signal patches to between-source relations: the held-out unit is an entire musical source, not a spectrogram block. The learned space is metric for compatibility — nearby embeddings correspond to stems that fit together — which directly supports stem retrieval, accompaniment matching, and compatibility estimation. The novelty is showing that the latent-prediction objective adapts cleanly to a compositional, multi-source modality, where the structure to be modeled is the relationship among simultaneously sounding parts rather than the internal texture of one signal.

Strengths & limitations

+ Learns a directly usable compatibility metric for music retrieval and accompaniment.
+ Type-conditioned predictor handles arbitrary missing instruments with one model.
+ Extends JEPA to relational, multi-source structure rather than within-signal patches.
− Requires stem-separated (multitrack) training data, which is scarcer than mixes.
− Compatibility as latent distance is an approximation; musical "fit" is partly subjective and context-dependent.
− Predicting an expected stem embedding cannot capture the multimodality of equally valid accompaniments.

Connections & references

Builds onI-JEPA A-JEPA

Paper ↗