Motivation
Generative JEPA models such as D-JEPA can synthesize images by predicting and sampling visual latents, but they operate purely on visual representations with no mechanism for text conditioning. They therefore cannot perform text-to-image generation despite having a strong latent-prediction backbone. Controllable generation — producing an image that matches a written prompt — requires injecting language information into the generative process. JEPA-T addresses this gap by adding text conditioning directly into the joint-embedding predictive objective, rather than relying on a separate pixel- or diffusion-token generator.
How it works
JEPA-T adds text fusion to a generative JEPA.
- Visual tokens pass through the usual context encoder, EMA target encoder, and a predictor that forecasts masked target embeddings in latent space.
- A text encoder embeds the caption.
- A fusion mechanism (cross-attention / feature fusion of text and visual tokens) injects the caption information into the predictor, so the latent prediction is conditioned on the textual prompt.
Generation proceeds by predicting and sampling visual latents that are consistent with both the visible context and the text, so text alignment is built directly into the embedding-space objective. As in the generative-JEPA line, sampling is done via denoising over masked target features, making the loss hybrid (generative plus latent prediction).
The objective
The text-conditioned predictor denoises toward masked visual target embeddings, conditioned on fused text features $c$:
$$\mathcal{L} = \mathbb{E}_{t,\epsilon}\,\big\lVert\, g_\phi\big(z_{\text{ctx}}, z_{\text{tgt}}^{(t)}, c, t\big) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$
where $c$ is the fused caption representation, $z_{\text{tgt}}^{(t)}$ is the target embedding corrupted at noise level $t$, $z_{\text{tgt}} = \bar f_\theta(x)$ is the EMA target (with $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$), and $\operatorname{sg}$ is stop-gradient. Conditioning on $c$ ties the sampled latents to the prompt.
Key results & what's novel
JEPA-T extends the generative-JEPA line (e.g., D-JEPA) toward multimodal, prompt-driven generation. Its novelty is conditioning the JEPA predictor on fused text features, which makes joint-embedding prediction controllable and enables text-to-image synthesis within the JEPA paradigm rather than in pixel or diffusion-token space alone. It shows that latent-prediction generators can incorporate language guidance through fusion, positioning text-conditioned joint-embedding prediction as a viable route to controllable image generation that retains JEPA's abstract, reconstruction-free representation learning.
Strengths & limitations
- + Enables text-to-image generation within the JEPA framework.
- + Text conditioning built into the latent objective via fusion, not bolted on downstream.
- + Retains JEPA's abstract, reconstruction-free representation learning.
- − Requires paired image-caption data for training.
- − Inherits the generative-JEPA cost of iterative latent sampling.
- − Fusion design and the balance of conditioning strength need tuning.