A-JEPA — World Modeling

At a glance

ProblemReconstruction audio models (AudioMAE) waste capacity on pixel-level spectrogram detail; contrastive audio models need augmentations and negatives.

Key ideaTransfer the I-JEPA recipe to sound: predict the representations of masked spectrogram blocks in latent space, with a curriculum over masks.

ModalityAudio (log-mel spectrogram as a 2D image)

Target / maskingContiguous time-frequency blocks; curriculum masking from random toward time-frequency-aware schemes.

Builds onI-JEPA's latent masked prediction; ViT audio backbones (AudioMAE lineage).

Used forGeneral audio and speech classification, audio tagging, transferable acoustic encoders.

Motivation

Self-supervised audio learning had two dominant recipes, each with a cost. Masked reconstruction (AudioMAE) rebuilds the masked spectrogram pixel by pixel, spending capacity on high-frequency spectro-temporal texture that carries little semantic content. Contrastive audio methods avoid reconstruction but depend on carefully engineered augmentations and large negative sets, importing priors about acoustic invariance that may not hold across speech, music, and environmental sound. A-JEPA asks a simple question: does the image-domain I-JEPA idea of predicting abstract latents of masked regions transfer to audio, removing both the decoder and the negatives while still capturing acoustic semantics?

How it works

Canonical JEPA schematic for Audio spectrogram. The input is split into a visible context and hidden targets (patch-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The log-mel spectrogram is treated as a 2D image and split into time-frequency patch tokens.

A context encoder $f_\theta$ (a ViT) embeds only a visible subset of patches.
An EMA target encoder $\bar f_\theta$ embeds the full spectrogram; its outputs at masked positions are the prediction targets, with gradients stopped.
A predictor $g_\phi$ takes context tokens plus mask tokens carrying the positions of masked target blocks and regresses their latent representations.

A central addition is curriculum masking: the mask schedule moves from simple random masking toward time-frequency-aware masks that respect the strongly anisotropic correlation structure of audio, where the time and frequency axes behave very differently.

The objective

For masked target blocks $k=1\dots M$, A-JEPA minimizes the latent $\ell_2$ distance between predicted and target embeddings:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2$$

where $\operatorname{sg}$ is stop-gradient and $m_k$ the mask token for block $k$. The target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. There is no spectrogram decoder and no contrastive negatives; the curriculum modulates which blocks are masked over the course of training, but the objective itself remains a pure representation-space regression.

Key results & what's novel

A-JEPA is the first demonstration that joint-embedding predictive pretraining is a viable, decoder-free recipe for audio. Treating the spectrogram as an image and predicting masked latents yields encoders that transfer strongly across audio and speech classification benchmarks, competitive with reconstruction-based AudioMAE while avoiding pixel-level decoding. The conceptual novelty is twofold: porting latent masked prediction to the acoustic domain, and introducing curriculum masking that respects the time-frequency structure and non-stationarity distinguishing audio from natural images. A-JEPA seeded a family of audio JEPA variants that further refine masking geometry and input representation.

Strengths & limitations

+ No spectrogram decoder and no contrastive negatives; clean transfer of the I-JEPA recipe.
+ Curriculum masking tailored to time-frequency structure improves over naive random masking.
+ Strong, transferable features across speech and general-audio tasks.
− Still tied to a fixed spectrogram transform, discarding phase and fixing time-frequency resolution.
− Masking design is a sensitive lever that needs tuning per audio domain.
− A representation learner, not generative; the predictor regresses an expected latent and can wash out fine acoustic detail.

Connections & references

Builds onI-JEPA

Paper ↗