At a glance
ProblemEarth-observation data is fragmented across modalities, resolutions, and spatial scales, forcing one bespoke model per sensor configuration.
Key ideaA single joint-embedding predictive model with scale-adaptive, modality-agnostic tokenization that absorbs heterogeneous EO inputs and predicts masked targets in latent space.
ModalityMulti-sensor satellite (optical, SAR, multispectral, time series)
Target / maskingMasked patch/time tokens whose EMA-encoded embeddings are predicted; patch size adapts to ground-sampling distance.
Builds onI-JEPA's masked latent prediction, generalized across sensors and scales.
Used forFoundation backbone for EO classification, segmentation, and time-series tasks.

Motivation

Remote sensing produces wildly heterogeneous data: optical and SAR imagery, dozens of spectral bands, and dense time series, captured at resolutions spanning sub-meter to decametric and at scales from small patches to large tiles. Most EO models are trained for one fixed configuration — a single sensor, a single resolution — so a new sensor or task means a new network. This fragmentation prevents a reusable foundation backbone for the field. AnySat asks whether a single model can ingest the full diversity of EO inputs jointly, producing features transferable across modalities, resolutions, and scales without per-configuration retraining.

How it works

Multi-sensor satellitepatchs · multi-blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Multi-sensor satellite. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

AnySat builds on the joint-embedding predictive architecture with a flexible front end.

  • Inputs of arbitrary resolution and channel set are projected into a shared token space; spatial scale is handled by adapting patch size to the ground-sampling distance, so one architecture sees diverse sensors consistently.
  • A context encoder processes a masked subset of patch and time tokens.
  • A predictor reconstructs the latent embeddings of masked targets produced by an EMA target encoder.

Because supervision lives entirely in representation space, the model never decodes heterogeneous raw signals and avoids per-modality reconstruction losses. The same masked latent-prediction pretext therefore applies uniformly whether the input is an optical tile, a SAR scene, or a multispectral time series.

The objective

For masked target tokens $k$, the loss is the latent distance between predicted and EMA-target representations:

$$\mathcal{L} = \frac{1}{M}\sum_k \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2$$

with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. The shared tokenizer ensures the same objective is computed identically across resolutions and modalities, so collapse is prevented by the standard predictor-plus-EMA mechanism rather than any modality-specific loss.

Key results & what's novel

AnySat's novelty is unifying EO heterogeneity under one joint-embedding model via scale-adaptive, modality-agnostic tokenization. A single network transfers to classification, semantic segmentation, and time-series tasks under varied sensor configurations without training separate backbones. By replacing generative reconstruction with abstract predictive alignment, it sidesteps the difficulty of decoding noisy, multi-band, multi-resolution signals — exactly where JEPA's latent objective is strongest. It advances the field toward a genuine remote-sensing foundation model rather than a collection of sensor-specific encoders.

Strengths & limitations

  • + One model spans modalities, resolutions, and scales; broadly reusable backbone.
  • + Latent supervision avoids per-modality reconstruction decoders.
  • + Transfers across diverse downstream EO tasks.
  • Scale-adaptive tokenization adds design complexity and assumptions about ground-sampling distance.
  • Joint training across very different sensors can dilute modality-specific detail.
  • Like all JEPAs, it is a representation learner, not a generative or action-conditioned world model.

Connections & references

Builds onI-JEPA