Motivation
Remote-sensing archives are enormous, and content-based image retrieval (RSIR) must return semantically similar scenes from such archives quickly. The dominant pretraining recipes are a poor fit: contrastive learning needs large batches of paired negatives and careful collapse management, while reconstruction-based methods spend capacity on pixel detail. Neither directly optimizes the embedding geometry that nearest-neighbor retrieval depends on. REJEPA targets descriptor learning specifically — embeddings that are cheap to compute, compact, and arranged so that semantic similarity corresponds to embedding proximity, all without the overhead of negatives or a pixel decoder.
How it works
REJEPA instantiates the joint-embedding predictive architecture for retrieval over EO image patches.
- A context encoder embeds a masked view of the image into context tokens.
- A lightweight predictor estimates the latent features of masked target regions.
- An EMA target encoder embeds the full image and supplies the (stop-gradient) target representations at the masked positions.
Training operates entirely in embedding space, so there is no pixel reconstruction and no negative sampling. After pretraining, the context encoder produces a single compact descriptor per image that is used directly for similarity search, making indexing and query-time computation efficient.
The objective
The loss is the latent-space distance between predicted and target embeddings over masked regions:
$$\mathcal{L} = \big\lVert\, \hat z - \operatorname{sg}(z)\,\big\rVert_2^2, \qquad \hat z = g_\phi(z_{\text{ctx}}, m),\;\; z = \bar f_\theta(x)$$
where $\operatorname{sg}$ is stop-gradient and the target encoder follows the EMA update $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. Because the objective shapes the geometry of the embedding directly — pulling predictable regions together in latent space — the resulting descriptors are organized by semantic content, which is precisely what retrieval requires.
Key results & what's novel
REJEPA shows that JEPA pretraining transfers naturally to an information-retrieval objective in the EO domain. Predicting masked content in embedding space induces a representation whose geometry reflects semantic content, yielding label-free descriptors that are efficient to compute and well-suited to nearest-neighbor search over large archives. By avoiding both contrastive negatives and generative decoding, it offers a practical alternative to the two prevailing pretraining families for RSIR, and it sets up the cross-modal extension pursued in X-JEPA.
Strengths & limitations
- + Compact, retrieval-ready descriptors learned without labels or negatives.
- + Efficient at indexing and query time; no pixel decoder.
- + Embedding geometry aligns with semantic similarity by construction.
- − Operates within a single modality; cross-sensor queries need X-JEPA.
- − Retrieval quality still depends on masking design and encoder capacity.
- − A pure representation learner; no generative or world-model capability.