At a glance
ProblemVision-language alignment via contrastive training needs large batches of paired negatives and learns only global instance-level alignment, missing fine-grained predictive structure.
Key ideaCast text-image alignment as energy-based joint-embedding prediction: text must predict image representations (and vice versa) in a shared latent space, with no negatives.
ModalityText + image
Target / maskingOne modality is context; the other modality's EMA-encoded embedding is the prediction target.
Builds onI-JEPA's predictive, energy-based formulation extended to two modalities.
Used forMultimodal pretraining and downstream text-image tasks.

Motivation

Multimodal models must align the semantics of text and images. The dominant approach, CLIP-style contrastive learning, pulls matched text-image pairs together and pushes mismatched pairs apart, which requires large batches of negatives and learns alignment only at the global instance level. This misses the finer predictive structure between modalities and ties success to negative-sampling scale. TI-JEPA reframes alignment as a prediction problem in an energy-based view: instead of contrasting against negatives, it asks whether one modality's representation can predict the other's, treating compatible pairs as low-energy configurations.

How it works

Text + imagetokens · cross-modalContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Text + image. The input is split into a visible context and hidden targets (token-level, cross-modal). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

TI-JEPA uses modality-specific encoders within a joint-embedding predictive architecture.

  • One modality (say text) is embedded by a context encoder into a context representation.
  • The other modality (image) is embedded by an EMA target encoder, producing stop-gradient target embeddings.
  • A predictor maps the context representation into the target's latent space.

Prediction is framed through an energy function that is low when cross-modal embeddings are mutually predictable, so compatible text-image pairs correspond to low-energy states. The stop-gradient EMA target prevents representational collapse, and the role of context and target can be assigned in either direction so that text predicts image and image predicts text.

The objective

Alignment is learned by minimizing the prediction energy between the predicted context-to-target mapping and the detached target embedding:

$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$

where $z_{\text{ctx}}$ comes from one modality, $z_{\text{tgt}}$ from the other via the EMA encoder updated as $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$, and $\operatorname{sg}$ is stop-gradient. Low energy corresponds to mutually predictable, hence aligned, cross-modal representations. No contrastive negatives appear in the loss.

Key results & what's novel

TI-JEPA brings JEPA's predictive, energy-based formulation to vision-language pretraining, offering an alternative to CLIP-style contrastive learning that does not depend on explicit negative sampling. Its novelty is treating text-image alignment as cross-modal latent prediction — semantic correspondence emerges from requiring mutual predictability rather than from contrasting many pairs. The learned cross-modal embeddings transfer to downstream multimodal tasks, demonstrating that joint-embedding prediction generalizes from masked-image self-supervision to genuine multimodal alignment.

Strengths & limitations

  • + No large-batch negative sampling; alignment from mutual predictability.
  • + Energy-based view connects cleanly to the JEPA family.
  • + Transferable cross-modal embeddings for downstream tasks.
  • Predicting an expected target can blur fine, multimodal cross-modal detail.
  • Requires paired text-image data to define the prediction targets.
  • Anti-collapse depends on the EMA/stop-gradient design being well tuned.

Connections & references

Builds onI-JEPA