At a glance
ProblemGenerative JEPA models operate purely on visual latents and lack text conditioning, so they cannot perform text-to-image generation.
Key ideaFuse caption features into the JEPA predictor so latent prediction is conditioned on text, making joint-embedding generation controllable.
ModalityText + image
Target / maskingMasked visual target tokens (EMA-encoded) predicted from visible context fused with text features.
Builds onGenerative JEPA (D-JEPA) plus a text-fusion conditioning mechanism.
Used forText-conditioned (prompt-driven) image generation.

Motivation

Generative JEPA models such as D-JEPA can synthesize images by predicting and sampling visual latents, but they operate purely on visual representations with no mechanism for text conditioning. They therefore cannot perform text-to-image generation despite having a strong latent-prediction backbone. Controllable generation — producing an image that matches a written prompt — requires injecting language information into the generative process. JEPA-T addresses this gap by adding text conditioning directly into the joint-embedding predictive objective, rather than relying on a separate pixel- or diffusion-token generator.

How it works

Text + imagepatchs · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copylocal loss (e.g. MLM)
Canonical JEPA schematic for Text + image. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

JEPA-T adds text fusion to a generative JEPA.

  • Visual tokens pass through the usual context encoder, EMA target encoder, and a predictor that forecasts masked target embeddings in latent space.
  • A text encoder embeds the caption.
  • A fusion mechanism (cross-attention / feature fusion of text and visual tokens) injects the caption information into the predictor, so the latent prediction is conditioned on the textual prompt.

Generation proceeds by predicting and sampling visual latents that are consistent with both the visible context and the text, so text alignment is built directly into the embedding-space objective. As in the generative-JEPA line, sampling is done via denoising over masked target features, making the loss hybrid (generative plus latent prediction).

The objective

The text-conditioned predictor denoises toward masked visual target embeddings, conditioned on fused text features $c$:

$$\mathcal{L} = \mathbb{E}_{t,\epsilon}\,\big\lVert\, g_\phi\big(z_{\text{ctx}}, z_{\text{tgt}}^{(t)}, c, t\big) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$

where $c$ is the fused caption representation, $z_{\text{tgt}}^{(t)}$ is the target embedding corrupted at noise level $t$, $z_{\text{tgt}} = \bar f_\theta(x)$ is the EMA target (with $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$), and $\operatorname{sg}$ is stop-gradient. Conditioning on $c$ ties the sampled latents to the prompt.

Key results & what's novel

JEPA-T extends the generative-JEPA line (e.g., D-JEPA) toward multimodal, prompt-driven generation. Its novelty is conditioning the JEPA predictor on fused text features, which makes joint-embedding prediction controllable and enables text-to-image synthesis within the JEPA paradigm rather than in pixel or diffusion-token space alone. It shows that latent-prediction generators can incorporate language guidance through fusion, positioning text-conditioned joint-embedding prediction as a viable route to controllable image generation that retains JEPA's abstract, reconstruction-free representation learning.

Strengths & limitations

  • + Enables text-to-image generation within the JEPA framework.
  • + Text conditioning built into the latent objective via fusion, not bolted on downstream.
  • + Retains JEPA's abstract, reconstruction-free representation learning.
  • Requires paired image-caption data for training.
  • Inherits the generative-JEPA cost of iterative latent sampling.
  • Fusion design and the balance of conditioning strength need tuning.

Connections & references

Builds onD-JEPA