At a glance
ProblemStandard JEPA predicts masked target embeddings deterministically, smoothing over the multimodal uncertainty of plausible content and limiting robustness.
Key ideaInject diffusion-style noise into the JEPA objective so the predictor denoises toward target embeddings across noise levels rather than regressing a single point.
ModalityImages
Target / maskingMasked target embeddings (EMA-encoded) recovered via a multi-step noise-conditioned denoising trajectory in latent space.
Builds onI-JEPA's masked latent prediction, regularized with diffusion noise.
Used forStronger, distribution-aware representations; connects to generative JEPA.

Motivation

The standard JEPA predictor regresses a single expected target embedding for each masked region. But masked content is genuinely ambiguous — many plausible completions exist — and a deterministic prediction collapses this distribution to its mean. This can smooth over multimodal uncertainty, leaving representations less robust and less useful for generative purposes. Drawing on diffusion models, which learn to denoise corrupted signals under a range of noise levels and thereby capture full distributions, this work asks whether injecting diffusion-style noise into JEPA can make latent prediction probabilistic and better regularized.

How it works

Imagepatchs · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copylocal loss (e.g. MLM)
Canonical JEPA schematic for Image. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

The method keeps the familiar JEPA components and adds a diffusion process to the prediction step.

  • A context encoder embeds the visible view.
  • An EMA target encoder embeds the full image, supplying stop-gradient target embeddings at masked positions.
  • A predictor operates in latent space, but the prediction process is corrupted with diffusion-style noise schedules, so the predictor learns to denoise toward the target embedding under varying noise levels rather than perform a single deterministic regression.

The latent loss is thus computed over a stochastic, multi-step denoising trajectory in embedding space, letting the model represent a distribution over masked-region features. The denoising objective runs together with the latent-prediction objective, making the overall loss hybrid.

The objective

The predictor is trained to recover the target embedding from a noised version at level $t$ drawn from a diffusion schedule:

$$\mathcal{L} = \mathbb{E}_{t,\epsilon}\,\big\lVert\, g_\phi\big(z_{\text{ctx}}, z_{\text{tgt}}^{(t)}, t\big) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$

where $z_{\text{tgt}}^{(t)}$ is the target embedding corrupted at noise level $t$, $z_{\text{tgt}} = \bar f_\theta(x)$ comes from the EMA encoder updated as $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$, and $\operatorname{sg}$ is stop-gradient. Averaging over noise levels turns the deterministic regression into a probabilistic denoising objective.

Key results & what's novel

By unifying diffusion's noise-conditioned denoising with JEPA's abstract latent-prediction principle, the work strengthens learned representations and connects to generative formulations such as D-JEPA. Adding diffusion noise turns JEPA's latent prediction into a probabilistic objective that captures the inherent ambiguity of masked content and regularizes the representation against collapse and over-smoothing. It exemplifies a 2025 trend of injecting stochasticity into joint-embedding prediction, improving both representation quality and the model's ability to reason about uncertain, multimodal targets.

Strengths & limitations

  • + Captures distributional uncertainty over masked content instead of its mean.
  • + Noise acts as a regularizer against collapse and over-smoothing.
  • + Bridges naturally to generative JEPA variants.
  • Multi-step denoising raises training cost over plain regression.
  • Adds noise-schedule hyperparameters to tune.
  • Primarily a representation regularizer; not a full generative model on its own.

Connections & references

Builds onI-JEPA