Motivation
Most audio JEPAs operate on spectrograms, which fixes a time-frequency resolution through a hand-chosen transform and discards phase information. Reconstruction- and contrastive-based objectives can additionally latch onto low-level spectral texture rather than semantic content. WavJEPA asks whether predicting latents directly from the raw waveform yields more robust representations — letting the model learn its own front-end abstraction instead of inheriting the biases of a fixed spectral transform, while keeping the semantic, decoder-free latent-prediction objective.
How it works
A learnable convolutional feature extractor converts the 1D waveform into a sequence of frame tokens; the natural masking unit is therefore a contiguous span of waveform frames.
- A context encoder $f_\theta$ embeds the unmasked frames.
- An EMA target encoder $\bar f_\theta$ produces the representations of the masked frame spans as targets, with gradients stopped.
- A predictor $g_\phi$ regresses those target latents from the visible context and position-carrying mask tokens.
No waveform reconstruction, no negatives, and no augmentation are used. Masking contiguous time spans mirrors the temporal structure of speech and general audio, forcing the model to infer missing acoustic content over realistic time scales.
The objective
For masked frame spans $k=1\dots M$, the loss is the latent $\ell_2$ distance:
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$
where $\operatorname{sg}$ is stop-gradient and the target encoder is updated by EMA. Crucially, both encoders include the learnable waveform front end, so the time-frequency abstraction is itself learned end-to-end rather than imposed by a spectrogram transform; the objective regresses the EMA latents of masked spans entirely in representation space.
Key results & what's novel
WavJEPA extends the JEPA family to the waveform domain, removing dependence on fixed spectral transforms by learning the front end inside the same latent-prediction objective. The authors argue that operating end-to-end on raw audio, combined with masking contiguous time spans, improves robustness to noise and channel variation relative to spectrogram pipelines, while latent masked prediction keeps the learning signal semantic. Evaluated across speech and general-audio tasks, it positions latent masked prediction on raw waveforms as a recipe for robust audio foundation models, complementing the spectrogram-based variants in the family.
Strengths & limitations
- + Learns its own time-frequency abstraction; no fixed spectral transform, no discarded phase.
- + Argued robustness to noise and channel variation from end-to-end waveform processing.
- + Decoder-free and augmentation-free, keeping the objective semantic.
- − Raw-waveform front ends are compute- and memory-heavy at high sample rates.
- − Span-masking design (length, ratio) is a sensitive lever, as in other audio JEPAs.
- − A representation learner, not generative; no dynamics or action conditioning.