Design Choices in JEPA for General Audio

At a glance

ProblemPorting JEPA to audio leaves many design choices under-specified, and vision-optimal settings need not hold for time-frequency signals.

Key ideaA controlled empirical study that varies masking, encoder, predictor, and target choices to find what drives audio JEPA representation quality.

ModalityGeneral audio (log-mel spectrograms)

Target / maskingSpectrogram patch/block; systematically varied (random vs block, time-only/freq-only/time-frequency, ratio, target count).

Builds onI-JEPA recipe and A-JEPA's spectrogram adaptation.

Used forA practical recipe and ablation map for general-audio JEPA encoders.

Motivation

When JEPA is moved from images to audio, a large design space opens up and little of it is settled. The masking geometry, mask ratio, encoder and predictor capacity, and target normalization were all tuned for natural images, but spectrograms have a very different statistical structure: the time and frequency axes carry distinct, anisotropic correlations, and audio is non-stationary in ways images are not. Rather than copying vision defaults and hoping, this work runs a controlled empirical study to identify which design choices actually determine representation quality for general audio, producing an evidence-based recipe instead of guesswork.

How it works

Canonical JEPA schematic for Audio spectrogram. The input is split into a visible context and hidden targets (patch-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The study fixes the standard JEPA template and varies its components one axis at a time. The template is:

a context encoder $f_\theta$ embedding visible spectrogram patches;
an EMA target encoder $\bar f_\theta$ producing the masked-region targets;
a predictor $g_\phi$ trained to match those latents under an $\ell_2$ loss.

Around this skeleton the authors sweep the masking unit and strategy (random vs. block; time-only, frequency-only, or joint time-frequency masking; mask ratio; number of target blocks), encoder architecture and capacity, predictor depth, and target normalization. Each configuration is evaluated by transfer to general-audio benchmarks spanning environmental sound, music, and speech.

The objective

Across all configurations the underlying objective is the JEPA latent regression,

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The study does not propose a new loss; instead it treats the masking distribution, the architectures of $f_\theta$, $\bar f_\theta$, $g_\phi$, and the target normalization as the experimental variables, holding the objective constant so that measured differences in downstream performance can be attributed to those design choices.

Key results & what's novel

The dominant finding is that masking geometry is the primary lever for audio JEPA. Masks aligned with the anisotropic correlation structure of spectrograms — accounting for the very different statistics along the time and frequency axes — matter more than the analogous choices do in vision, and the target/predictor configuration interacts strongly with the chosen masking regime. The contribution is not a single new model but a systematic ablation map: a documented, evidence-based recipe for configuring general-audio JEPA encoders that removes guesswork and directly informs later raw-waveform and music-specific variants.

Strengths & limitations

+ Turns ad-hoc design into measured, reproducible guidance for audio JEPA.
+ Isolates masking geometry as the key driver, with practical defaults.
+ Covers diverse audio domains (environmental, music, speech).
− An empirical study, not a new method; conclusions are bounded by the architectures and benchmarks swept.
− Restricted to spectrogram inputs, so phase and fixed-resolution limitations of that representation persist.
− Findings may need re-validation as backbones and datasets scale.

Connections & references

Builds onI-JEPA A-JEPA

Paper ↗