Motivation
When JEPA is moved from images to audio, a large design space opens up and little of it is settled. The masking geometry, mask ratio, encoder and predictor capacity, and target normalization were all tuned for natural images, but spectrograms have a very different statistical structure: the time and frequency axes carry distinct, anisotropic correlations, and audio is non-stationary in ways images are not. Rather than copying vision defaults and hoping, this work runs a controlled empirical study to identify which design choices actually determine representation quality for general audio, producing an evidence-based recipe instead of guesswork.
How it works
The study fixes the standard JEPA template and varies its components one axis at a time. The template is:
- a context encoder $f_\theta$ embedding visible spectrogram patches;
- an EMA target encoder $\bar f_\theta$ producing the masked-region targets;
- a predictor $g_\phi$ trained to match those latents under an $\ell_2$ loss.
Around this skeleton the authors sweep the masking unit and strategy (random vs. block; time-only, frequency-only, or joint time-frequency masking; mask ratio; number of target blocks), encoder architecture and capacity, predictor depth, and target normalization. Each configuration is evaluated by transfer to general-audio benchmarks spanning environmental sound, music, and speech.
The objective
Across all configurations the underlying objective is the JEPA latent regression,
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The study does not propose a new loss; instead it treats the masking distribution, the architectures of $f_\theta$, $\bar f_\theta$, $g_\phi$, and the target normalization as the experimental variables, holding the objective constant so that measured differences in downstream performance can be attributed to those design choices.
Key results & what's novel
The dominant finding is that masking geometry is the primary lever for audio JEPA. Masks aligned with the anisotropic correlation structure of spectrograms — accounting for the very different statistics along the time and frequency axes — matter more than the analogous choices do in vision, and the target/predictor configuration interacts strongly with the chosen masking regime. The contribution is not a single new model but a systematic ablation map: a documented, evidence-based recipe for configuring general-audio JEPA encoders that removes guesswork and directly informs later raw-waveform and music-specific variants.
Strengths & limitations
- + Turns ad-hoc design into measured, reproducible guidance for audio JEPA.
- + Isolates masking geometry as the key driver, with practical defaults.
- + Covers diverse audio domains (environmental, music, speech).
- − An empirical study, not a new method; conclusions are bounded by the architectures and benchmarks swept.
- − Restricted to spectrogram inputs, so phase and fixed-resolution limitations of that representation persist.
- − Findings may need re-validation as backbones and datasets scale.