Motivation
General-purpose audio encoders are most often trained either by masked spectrogram reconstruction or by contrastive learning. The first carries the overhead of a decoder and the bias of reconstructing low-level spectro-temporal detail; the second requires negative mining and a hand-designed augmentation pipeline whose invariances may not suit all audio domains. Audio-JEPA aims for a self-supervised audio foundation model that sidesteps both overheads by learning entirely through latent prediction — keeping the objective semantic while removing the decoder and the negatives.
How it works
Audio is converted to a spectrogram and tokenized into patches.
- A context encoder $f_\theta$ encodes only the visible patches.
- An EMA target encoder $\bar f_\theta$ encodes the full spectrogram; its outputs at masked positions are the targets, with gradients stopped.
- A predictor $g_\phi$ estimates the latent embeddings of the masked target regions from the context tokens and position-carrying mask tokens.
Patches are masked at a high ratio, so the predictor must infer abstract acoustic content from limited visible context rather than interpolate nearby texture. No pixel/spectrogram reconstruction is performed and no contrastive negatives are used — the recipe closely mirrors I-JEPA in vision, applied to time-frequency tokens.
The objective
For masked target regions $k=1\dots M$, training minimizes the representation-space $\ell_2$ regression loss:
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$
where $\operatorname{sg}$ denotes stop-gradient and the target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. The combination of EMA targets, a narrow predictor, and high-ratio masking provides the anti-collapse pressure; there is no auxiliary reconstruction or contrastive term.
Key results & what's novel
Audio-JEPA contributes a clean, decoder-free baseline for general audio representation learning and evidence that joint-embedding predictive pretraining competes with reconstruction- and contrastive-based audio SSL across classification and tagging benchmarks. Its representations transfer across diverse audio tasks without a task-specific augmentation pipeline. The contribution is consolidation more than reinvention: by faithfully transferring the I-JEPA recipe — predict abstract latents of masked time-frequency regions — it helps establish JEPA as a practical recipe for audio foundation models alongside the spectrogram and waveform variants in the family.
Strengths & limitations
- + No decoder and no contrastive negatives or augmentation engineering.
- + Broadly transferable features across audio classification and tagging.
- + A clean, reproducible baseline that consolidates the audio JEPA recipe.
- − Bound to a fixed spectrogram transform, discarding phase and fixing time-frequency resolution.
- − Sensitive to masking ratio and geometry, as the audio-design study makes explicit.
- − A representation learner, not generative; offers no dynamics or action modeling.