Audio-JEPA — World Modeling

At a glance

ProblemGeneral audio encoders are trained by masked spectrogram reconstruction or contrastive learning, paying for decoding effort or negative mining and augmentation design.

Key ideaA faithful transfer of the I-JEPA recipe to audio: predict abstract latents of masked time-frequency regions, no decoder, no negatives.

ModalityGeneral audio (spectrogram patches)

Target / maskingSpectrogram patch/block masked at a high ratio; targets are EMA latents of masked regions.

Builds onI-JEPA latent masked prediction; A-JEPA's audio adaptation.

Used forAudio foundation encoders; classification and tagging transfer benchmarks.

Motivation

General-purpose audio encoders are most often trained either by masked spectrogram reconstruction or by contrastive learning. The first carries the overhead of a decoder and the bias of reconstructing low-level spectro-temporal detail; the second requires negative mining and a hand-designed augmentation pipeline whose invariances may not suit all audio domains. Audio-JEPA aims for a self-supervised audio foundation model that sidesteps both overheads by learning entirely through latent prediction — keeping the objective semantic while removing the decoder and the negatives.

How it works

Canonical JEPA schematic for Audio spectrogram. The input is split into a visible context and hidden targets (patch-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

Audio is converted to a spectrogram and tokenized into patches.

A context encoder $f_\theta$ encodes only the visible patches.
An EMA target encoder $\bar f_\theta$ encodes the full spectrogram; its outputs at masked positions are the targets, with gradients stopped.
A predictor $g_\phi$ estimates the latent embeddings of the masked target regions from the context tokens and position-carrying mask tokens.

Patches are masked at a high ratio, so the predictor must infer abstract acoustic content from limited visible context rather than interpolate nearby texture. No pixel/spectrogram reconstruction is performed and no contrastive negatives are used — the recipe closely mirrors I-JEPA in vision, applied to time-frequency tokens.

The objective

For masked target regions $k=1\dots M$, training minimizes the representation-space $\ell_2$ regression loss:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$

where $\operatorname{sg}$ denotes stop-gradient and the target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. The combination of EMA targets, a narrow predictor, and high-ratio masking provides the anti-collapse pressure; there is no auxiliary reconstruction or contrastive term.

Key results & what's novel

Audio-JEPA contributes a clean, decoder-free baseline for general audio representation learning and evidence that joint-embedding predictive pretraining competes with reconstruction- and contrastive-based audio SSL across classification and tagging benchmarks. Its representations transfer across diverse audio tasks without a task-specific augmentation pipeline. The contribution is consolidation more than reinvention: by faithfully transferring the I-JEPA recipe — predict abstract latents of masked time-frequency regions — it helps establish JEPA as a practical recipe for audio foundation models alongside the spectrogram and waveform variants in the family.

Strengths & limitations

+ No decoder and no contrastive negatives or augmentation engineering.
+ Broadly transferable features across audio classification and tagging.
+ A clean, reproducible baseline that consolidates the audio JEPA recipe.
− Bound to a fixed spectrogram transform, discarding phase and fixing time-frequency resolution.
− Sensitive to masking ratio and geometry, as the audio-design study makes explicit.
− A representation learner, not generative; offers no dynamics or action modeling.

Connections & references

Builds onI-JEPA A-JEPA

Paper ↗