Motivation
Multimodal models must align the semantics of text and images. The dominant approach, CLIP-style contrastive learning, pulls matched text-image pairs together and pushes mismatched pairs apart, which requires large batches of negatives and learns alignment only at the global instance level. This misses the finer predictive structure between modalities and ties success to negative-sampling scale. TI-JEPA reframes alignment as a prediction problem in an energy-based view: instead of contrasting against negatives, it asks whether one modality's representation can predict the other's, treating compatible pairs as low-energy configurations.
How it works
TI-JEPA uses modality-specific encoders within a joint-embedding predictive architecture.
- One modality (say text) is embedded by a context encoder into a context representation.
- The other modality (image) is embedded by an EMA target encoder, producing stop-gradient target embeddings.
- A predictor maps the context representation into the target's latent space.
Prediction is framed through an energy function that is low when cross-modal embeddings are mutually predictable, so compatible text-image pairs correspond to low-energy states. The stop-gradient EMA target prevents representational collapse, and the role of context and target can be assigned in either direction so that text predicts image and image predicts text.
The objective
Alignment is learned by minimizing the prediction energy between the predicted context-to-target mapping and the detached target embedding:
$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$
where $z_{\text{ctx}}$ comes from one modality, $z_{\text{tgt}}$ from the other via the EMA encoder updated as $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$, and $\operatorname{sg}$ is stop-gradient. Low energy corresponds to mutually predictable, hence aligned, cross-modal representations. No contrastive negatives appear in the loss.
Key results & what's novel
TI-JEPA brings JEPA's predictive, energy-based formulation to vision-language pretraining, offering an alternative to CLIP-style contrastive learning that does not depend on explicit negative sampling. Its novelty is treating text-image alignment as cross-modal latent prediction — semantic correspondence emerges from requiring mutual predictability rather than from contrasting many pairs. The learned cross-modal embeddings transfer to downstream multimodal tasks, demonstrating that joint-embedding prediction generalizes from masked-image self-supervision to genuine multimodal alignment.
Strengths & limitations
- + No large-batch negative sampling; alignment from mutual predictability.
- + Energy-based view connects cleanly to the JEPA family.
- + Transferable cross-modal embeddings for downstream tasks.
- − Predicting an expected target can blur fine, multimodal cross-modal detail.
- − Requires paired text-image data to define the prediction targets.
- − Anti-collapse depends on the EMA/stop-gradient design being well tuned.