Motivation
JEPA produces excellent representations but cannot generate: its predictor regresses an expected target embedding, washing out the multimodal detail needed for synthesis. Diffusion and autoregressive models generate well but operate in pixel or token space and lack the abstract, semantic guidance that JEPA's latent targets provide. Unifying representation learning with generation — a single model that both perceives and synthesizes — is an open goal. D-JEPA pursues it by making the JEPA objective itself generative, so the same latent-prediction backbone can sample images while retaining JEPA's representational benefits.
How it works
D-JEPA reframes generation as denoising in feature space inside the JEPA framework.
- A context encoder embeds visible tokens.
- A predictor forecasts the representations of masked target tokens, which are encoded by an EMA target encoder.
- Instead of a single regression target, the prediction is trained as a denoising / next-token diffusion process, so the predictor learns a distribution over target features and can sample them.
This blends masked autoregressive generation with diffusion-style denoising, all guided by JEPA latent targets. Because the model represents distributions rather than means, it can synthesize high-quality images while the same backbone yields discriminative features. The denoising/generative loss runs jointly with latent prediction, making the objective hybrid.
The objective
For masked targets, the predictor is trained as a denoiser that recovers the target embedding from a noised version under noise level $t$:
$$\mathcal{L} = \mathbb{E}_{t,\epsilon}\,\big\lVert\, g_\phi\big(z_{\text{ctx}}, z_{\text{tgt}}^{(t)}, t\big) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$
where $z_{\text{tgt}}^{(t)}$ is the target embedding corrupted at level $t$, $z_{\text{tgt}} = \bar f_\theta(x)$ comes from the EMA encoder, and $\operatorname{sg}$ is stop-gradient. Sampling proceeds by iteratively denoising target features autoregressively across masked positions, turning latent prediction into a generative process.
Key results & what's novel
D-JEPA demonstrates that JEPA's latent-prediction backbone can power strong image generation while retaining the abstraction benefits of embedding-space learning. Its novelty is turning the predictor into a generative denoiser over latent targets, so joint-embedding prediction becomes a generative objective rather than only a representation-learning one. By marrying masked autoregressive generation with diffusion-style denoising in feature space, it bridges the historically separate worlds of self-supervised JEPA representation learning and generative modeling, pointing toward unified perceive-and-generate models.
Strengths & limitations
- + Unifies representation learning and image synthesis in one model.
- + Samples distributions, overcoming JEPA's mean-regression limitation.
- + Generation is guided by abstract latent targets, not raw pixels.
- − Iterative denoising makes sampling more expensive than a single forward pass.
- − Combining diffusion and EMA-target machinery adds training complexity.
- − Balancing the generative and predictive terms requires careful tuning.