At a glance
ProblemContent-based remote-sensing image retrieval must index huge archives quickly, but contrastive and reconstruction pretraining give embeddings that are costly and not optimized for retrieval similarity.
Key ideaUse masked latent prediction to learn compact, semantically organized embeddings that serve directly as retrieval descriptors, with no pixel decoding or negative sampling.
ModalityRemote-sensing imagery
Target / maskingMasked target regions whose EMA-encoded latent features are predicted from a masked context view.
Builds onI-JEPA's joint-embedding predictive recipe.
Used forEfficient large-archive remote-sensing image retrieval (RSIR).

Motivation

Remote-sensing archives are enormous, and content-based image retrieval (RSIR) must return semantically similar scenes from such archives quickly. The dominant pretraining recipes are a poor fit: contrastive learning needs large batches of paired negatives and careful collapse management, while reconstruction-based methods spend capacity on pixel detail. Neither directly optimizes the embedding geometry that nearest-neighbor retrieval depends on. REJEPA targets descriptor learning specifically — embeddings that are cheap to compute, compact, and arranged so that semantic similarity corresponds to embedding proximity, all without the overhead of negatives or a pixel decoder.

How it works

Remote-sensing imagepatchs · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Remote-sensing image. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

REJEPA instantiates the joint-embedding predictive architecture for retrieval over EO image patches.

  • A context encoder embeds a masked view of the image into context tokens.
  • A lightweight predictor estimates the latent features of masked target regions.
  • An EMA target encoder embeds the full image and supplies the (stop-gradient) target representations at the masked positions.

Training operates entirely in embedding space, so there is no pixel reconstruction and no negative sampling. After pretraining, the context encoder produces a single compact descriptor per image that is used directly for similarity search, making indexing and query-time computation efficient.

The objective

The loss is the latent-space distance between predicted and target embeddings over masked regions:

$$\mathcal{L} = \big\lVert\, \hat z - \operatorname{sg}(z)\,\big\rVert_2^2, \qquad \hat z = g_\phi(z_{\text{ctx}}, m),\;\; z = \bar f_\theta(x)$$

where $\operatorname{sg}$ is stop-gradient and the target encoder follows the EMA update $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. Because the objective shapes the geometry of the embedding directly — pulling predictable regions together in latent space — the resulting descriptors are organized by semantic content, which is precisely what retrieval requires.

Key results & what's novel

REJEPA shows that JEPA pretraining transfers naturally to an information-retrieval objective in the EO domain. Predicting masked content in embedding space induces a representation whose geometry reflects semantic content, yielding label-free descriptors that are efficient to compute and well-suited to nearest-neighbor search over large archives. By avoiding both contrastive negatives and generative decoding, it offers a practical alternative to the two prevailing pretraining families for RSIR, and it sets up the cross-modal extension pursued in X-JEPA.

Strengths & limitations

  • + Compact, retrieval-ready descriptors learned without labels or negatives.
  • + Efficient at indexing and query time; no pixel decoder.
  • + Embedding geometry aligns with semantic similarity by construction.
  • Operates within a single modality; cross-sensor queries need X-JEPA.
  • Retrieval quality still depends on masking design and encoder capacity.
  • A pure representation learner; no generative or world-model capability.

Connections & references

Builds onI-JEPA