Motivation
The standard JEPA predictor regresses a single expected target embedding for each masked region. But masked content is genuinely ambiguous — many plausible completions exist — and a deterministic prediction collapses this distribution to its mean. This can smooth over multimodal uncertainty, leaving representations less robust and less useful for generative purposes. Drawing on diffusion models, which learn to denoise corrupted signals under a range of noise levels and thereby capture full distributions, this work asks whether injecting diffusion-style noise into JEPA can make latent prediction probabilistic and better regularized.
How it works
The method keeps the familiar JEPA components and adds a diffusion process to the prediction step.
- A context encoder embeds the visible view.
- An EMA target encoder embeds the full image, supplying stop-gradient target embeddings at masked positions.
- A predictor operates in latent space, but the prediction process is corrupted with diffusion-style noise schedules, so the predictor learns to denoise toward the target embedding under varying noise levels rather than perform a single deterministic regression.
The latent loss is thus computed over a stochastic, multi-step denoising trajectory in embedding space, letting the model represent a distribution over masked-region features. The denoising objective runs together with the latent-prediction objective, making the overall loss hybrid.
The objective
The predictor is trained to recover the target embedding from a noised version at level $t$ drawn from a diffusion schedule:
$$\mathcal{L} = \mathbb{E}_{t,\epsilon}\,\big\lVert\, g_\phi\big(z_{\text{ctx}}, z_{\text{tgt}}^{(t)}, t\big) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$
where $z_{\text{tgt}}^{(t)}$ is the target embedding corrupted at noise level $t$, $z_{\text{tgt}} = \bar f_\theta(x)$ comes from the EMA encoder updated as $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$, and $\operatorname{sg}$ is stop-gradient. Averaging over noise levels turns the deterministic regression into a probabilistic denoising objective.
Key results & what's novel
By unifying diffusion's noise-conditioned denoising with JEPA's abstract latent-prediction principle, the work strengthens learned representations and connects to generative formulations such as D-JEPA. Adding diffusion noise turns JEPA's latent prediction into a probabilistic objective that captures the inherent ambiguity of masked content and regularizes the representation against collapse and over-smoothing. It exemplifies a 2025 trend of injecting stochasticity into joint-embedding prediction, improving both representation quality and the model's ability to reason about uncertain, multimodal targets.
Strengths & limitations
- + Captures distributional uncertainty over masked content instead of its mean.
- + Noise acts as a regularizer against collapse and over-smoothing.
- + Bridges naturally to generative JEPA variants.
- − Multi-step denoising raises training cost over plain regression.
- − Adds noise-schedule hyperparameters to tune.
- − Primarily a representation regularizer; not a full generative model on its own.