At a glance
ProblemRemote-sensing archives mix sensors (e.g., optical and SAR), but per-modality embeddings prevent a query in one sensor from retrieving relevant imagery in another.
Key ideaPose cross-modal correspondence as predicting one modality's embedding from another's, building a shared retrieval-ready space without pixel-level translation.
ModalityRemote-sensing, cross-modal (e.g., optical ↔ SAR)
Target / maskingA masked target drawn from one modality, predicted from context in a possibly different modality (EMA targets).
Builds onREJEPA's retrieval-oriented JEPA, generalized across modalities.
Used forCross-modal remote-sensing image retrieval.

Motivation

Earth-observation archives contain co-located scenes captured by fundamentally different sensors — optical and synthetic-aperture radar being the canonical pair. Yet retrieval systems typically embed each modality with its own encoder into its own space, so a query expressed in one sensor cannot reliably surface relevant imagery from another. What is needed is a single retrieval-ready space in which co-located scenes from heterogeneous sensors map to nearby points, enabling cross-sensor search. X-JEPA extends the REJEPA retrieval line to address exactly this cross-modal correspondence problem.

How it works

Remote-sensing (cross-modal)patchs · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Remote-sensing (cross-modal). The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

X-JEPA generalizes joint-embedding prediction across modalities.

  • One modality serves as context and is embedded by a context encoder into context tokens.
  • A predictor regresses the latent representation of a masked target drawn from a (possibly different) modality.
  • An EMA target encoder embeds the target modality and supplies the stop-gradient targets.

Because the predictor must map from one modality's embedding to another's, minimizing the latent loss aligns the two modality spaces directly — without any pixel-level cross-modal synthesis or translation. The result is a modality-agnostic space where, for example, an optical query and the corresponding SAR scene land close together, supporting cross-sensor nearest-neighbor retrieval.

The objective

For a context view in modality $a$ and a masked target region in modality $b$, the loss is:

$$\mathcal{L} = \big\lVert\, g_\phi\big(f_\theta(x^a), m\big) - \operatorname{sg}\big[\bar f_\theta(x^b)\big]\,\big\rVert_2^2$$

where $\operatorname{sg}$ is stop-gradient and the target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. Predicting across modalities forces the shared latent space to encode content that is common to both sensors, which is the alignment signal retrieval exploits — achieved without contrastive pairs or generative translation.

Key results & what's novel

Presented at WACV 2026, X-JEPA extends the REJEPA descriptor-learning line from intra-image masking to genuine cross-modal correspondence. Its novelty is showing that JEPA's predictor-plus-latent-loss design scales beyond a single image to bridge fundamentally different imaging modalities, yielding compact descriptors that enable queries across sensor types. This is a key capability for unified EO search, replacing both contrastive cross-modal pairing and generative cross-modal synthesis with a single predictive alignment objective.

Strengths & limitations

  • + Enables cross-sensor retrieval through a single shared embedding space.
  • + No pixel-level translation or contrastive negatives required.
  • + Inherits REJEPA's compact, efficient descriptors.
  • Requires co-located multi-modal pairs for training.
  • Alignment quality depends on how much content the modalities genuinely share.
  • Focused on retrieval; not a generative or action-conditioned model.

Connections & references

Builds onREJEPA