WavJEPA — World Modeling

At a glance

ProblemSpectrogram audio SSL discards phase and fixes a time-frequency resolution; reconstruction/contrastive objectives can latch onto low-level texture.

Key ideaPredict latents directly from the raw waveform: a learnable front end produces frame tokens and the model regresses masked frame-span representations.

ModalityRaw audio waveform (1D)

Target / maskingContiguous spans of waveform frames; targets are EMA latents of masked spans.

Builds onI-JEPA / audio-JEPA latent prediction; learnable convolutional front ends from waveform SSL.

Used forRobust audio foundation models across speech and general-audio tasks.

Motivation

Most audio JEPAs operate on spectrograms, which fixes a time-frequency resolution through a hand-chosen transform and discards phase information. Reconstruction- and contrastive-based objectives can additionally latch onto low-level spectral texture rather than semantic content. WavJEPA asks whether predicting latents directly from the raw waveform yields more robust representations — letting the model learn its own front-end abstraction instead of inheriting the biases of a fixed spectral transform, while keeping the semantic, decoder-free latent-prediction objective.

How it works

Canonical JEPA schematic for Raw waveform. The input is split into a visible context and hidden targets (frame-level, span). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

A learnable convolutional feature extractor converts the 1D waveform into a sequence of frame tokens; the natural masking unit is therefore a contiguous span of waveform frames.

A context encoder $f_\theta$ embeds the unmasked frames.
An EMA target encoder $\bar f_\theta$ produces the representations of the masked frame spans as targets, with gradients stopped.
A predictor $g_\phi$ regresses those target latents from the visible context and position-carrying mask tokens.

No waveform reconstruction, no negatives, and no augmentation are used. Masking contiguous time spans mirrors the temporal structure of speech and general audio, forcing the model to infer missing acoustic content over realistic time scales.

The objective

For masked frame spans $k=1\dots M$, the loss is the latent $\ell_2$ distance:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$

where $\operatorname{sg}$ is stop-gradient and the target encoder is updated by EMA. Crucially, both encoders include the learnable waveform front end, so the time-frequency abstraction is itself learned end-to-end rather than imposed by a spectrogram transform; the objective regresses the EMA latents of masked spans entirely in representation space.

Key results & what's novel

WavJEPA extends the JEPA family to the waveform domain, removing dependence on fixed spectral transforms by learning the front end inside the same latent-prediction objective. The authors argue that operating end-to-end on raw audio, combined with masking contiguous time spans, improves robustness to noise and channel variation relative to spectrogram pipelines, while latent masked prediction keeps the learning signal semantic. Evaluated across speech and general-audio tasks, it positions latent masked prediction on raw waveforms as a recipe for robust audio foundation models, complementing the spectrogram-based variants in the family.

Strengths & limitations

+ Learns its own time-frequency abstraction; no fixed spectral transform, no discarded phase.
+ Argued robustness to noise and channel variation from end-to-end waveform processing.
+ Decoder-free and augmentation-free, keeping the objective semantic.
− Raw-waveform front ends are compute- and memory-heavy at high sample rates.
− Span-masking design (length, ratio) is a sensitive lever, as in other audio JEPAs.
− A representation learner, not generative; no dynamics or action conditioning.

Connections & references

Builds onI-JEPA Audio-JEPA

Paper ↗