At a glance
ProblemJEPA learns strong representations but is not generative; diffusion and autoregressive generators work in pixel/token space and lack JEPA's abstract feature guidance.
Key ideaReframe generation as denoising in feature space: turn the JEPA predictor into a generative denoiser over latent targets so it can sample, not just regress means.
ModalityImages
Target / maskingMasked target tokens (EMA-encoded) predicted via a denoising / next-token diffusion process rather than single-point regression.
Builds onI-JEPA's masked latent prediction, made generative via denoising.
Used forUnified high-quality image synthesis and discriminative representation learning.

Motivation

JEPA produces excellent representations but cannot generate: its predictor regresses an expected target embedding, washing out the multimodal detail needed for synthesis. Diffusion and autoregressive models generate well but operate in pixel or token space and lack the abstract, semantic guidance that JEPA's latent targets provide. Unifying representation learning with generation — a single model that both perceives and synthesizes — is an open goal. D-JEPA pursues it by making the JEPA objective itself generative, so the same latent-prediction backbone can sample images while retaining JEPA's representational benefits.

How it works

Imagepatchs · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copylocal loss (e.g. MLM)
Canonical JEPA schematic for Image. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

D-JEPA reframes generation as denoising in feature space inside the JEPA framework.

  • A context encoder embeds visible tokens.
  • A predictor forecasts the representations of masked target tokens, which are encoded by an EMA target encoder.
  • Instead of a single regression target, the prediction is trained as a denoising / next-token diffusion process, so the predictor learns a distribution over target features and can sample them.

This blends masked autoregressive generation with diffusion-style denoising, all guided by JEPA latent targets. Because the model represents distributions rather than means, it can synthesize high-quality images while the same backbone yields discriminative features. The denoising/generative loss runs jointly with latent prediction, making the objective hybrid.

The objective

For masked targets, the predictor is trained as a denoiser that recovers the target embedding from a noised version under noise level $t$:

$$\mathcal{L} = \mathbb{E}_{t,\epsilon}\,\big\lVert\, g_\phi\big(z_{\text{ctx}}, z_{\text{tgt}}^{(t)}, t\big) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$

where $z_{\text{tgt}}^{(t)}$ is the target embedding corrupted at level $t$, $z_{\text{tgt}} = \bar f_\theta(x)$ comes from the EMA encoder, and $\operatorname{sg}$ is stop-gradient. Sampling proceeds by iteratively denoising target features autoregressively across masked positions, turning latent prediction into a generative process.

Key results & what's novel

D-JEPA demonstrates that JEPA's latent-prediction backbone can power strong image generation while retaining the abstraction benefits of embedding-space learning. Its novelty is turning the predictor into a generative denoiser over latent targets, so joint-embedding prediction becomes a generative objective rather than only a representation-learning one. By marrying masked autoregressive generation with diffusion-style denoising in feature space, it bridges the historically separate worlds of self-supervised JEPA representation learning and generative modeling, pointing toward unified perceive-and-generate models.

Strengths & limitations

  • + Unifies representation learning and image synthesis in one model.
  • + Samples distributions, overcoming JEPA's mean-regression limitation.
  • + Generation is guided by abstract latent targets, not raw pixels.
  • Iterative denoising makes sampling more expensive than a single forward pass.
  • Combining diffusion and EMA-target machinery adds training complexity.
  • Balancing the generative and predictive terms requires careful tuning.

Connections & references

Builds onI-JEPA