Motivation
Earth-observation archives contain co-located scenes captured by fundamentally different sensors — optical and synthetic-aperture radar being the canonical pair. Yet retrieval systems typically embed each modality with its own encoder into its own space, so a query expressed in one sensor cannot reliably surface relevant imagery from another. What is needed is a single retrieval-ready space in which co-located scenes from heterogeneous sensors map to nearby points, enabling cross-sensor search. X-JEPA extends the REJEPA retrieval line to address exactly this cross-modal correspondence problem.
How it works
X-JEPA generalizes joint-embedding prediction across modalities.
- One modality serves as context and is embedded by a context encoder into context tokens.
- A predictor regresses the latent representation of a masked target drawn from a (possibly different) modality.
- An EMA target encoder embeds the target modality and supplies the stop-gradient targets.
Because the predictor must map from one modality's embedding to another's, minimizing the latent loss aligns the two modality spaces directly — without any pixel-level cross-modal synthesis or translation. The result is a modality-agnostic space where, for example, an optical query and the corresponding SAR scene land close together, supporting cross-sensor nearest-neighbor retrieval.
The objective
For a context view in modality $a$ and a masked target region in modality $b$, the loss is:
$$\mathcal{L} = \big\lVert\, g_\phi\big(f_\theta(x^a), m\big) - \operatorname{sg}\big[\bar f_\theta(x^b)\big]\,\big\rVert_2^2$$
where $\operatorname{sg}$ is stop-gradient and the target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. Predicting across modalities forces the shared latent space to encode content that is common to both sensors, which is the alignment signal retrieval exploits — achieved without contrastive pairs or generative translation.
Key results & what's novel
Presented at WACV 2026, X-JEPA extends the REJEPA descriptor-learning line from intra-image masking to genuine cross-modal correspondence. Its novelty is showing that JEPA's predictor-plus-latent-loss design scales beyond a single image to bridge fundamentally different imaging modalities, yielding compact descriptors that enable queries across sensor types. This is a key capability for unified EO search, replacing both contrastive cross-modal pairing and generative cross-modal synthesis with a single predictive alignment objective.
Strengths & limitations
- + Enables cross-sensor retrieval through a single shared embedding space.
- + No pixel-level translation or contrastive negatives required.
- + Inherits REJEPA's compact, efficient descriptors.
- − Requires co-located multi-modal pairs for training.
- − Alignment quality depends on how much content the modalities genuinely share.
- − Focused on retrieval; not a generative or action-conditioned model.