Motivation
Remote sensing produces wildly heterogeneous data: optical and SAR imagery, dozens of spectral bands, and dense time series, captured at resolutions spanning sub-meter to decametric and at scales from small patches to large tiles. Most EO models are trained for one fixed configuration — a single sensor, a single resolution — so a new sensor or task means a new network. This fragmentation prevents a reusable foundation backbone for the field. AnySat asks whether a single model can ingest the full diversity of EO inputs jointly, producing features transferable across modalities, resolutions, and scales without per-configuration retraining.
How it works
AnySat builds on the joint-embedding predictive architecture with a flexible front end.
- Inputs of arbitrary resolution and channel set are projected into a shared token space; spatial scale is handled by adapting patch size to the ground-sampling distance, so one architecture sees diverse sensors consistently.
- A context encoder processes a masked subset of patch and time tokens.
- A predictor reconstructs the latent embeddings of masked targets produced by an EMA target encoder.
Because supervision lives entirely in representation space, the model never decodes heterogeneous raw signals and avoids per-modality reconstruction losses. The same masked latent-prediction pretext therefore applies uniformly whether the input is an optical tile, a SAR scene, or a multispectral time series.
The objective
For masked target tokens $k$, the loss is the latent distance between predicted and EMA-target representations:
$$\mathcal{L} = \frac{1}{M}\sum_k \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2$$
with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. The shared tokenizer ensures the same objective is computed identically across resolutions and modalities, so collapse is prevented by the standard predictor-plus-EMA mechanism rather than any modality-specific loss.
Key results & what's novel
AnySat's novelty is unifying EO heterogeneity under one joint-embedding model via scale-adaptive, modality-agnostic tokenization. A single network transfers to classification, semantic segmentation, and time-series tasks under varied sensor configurations without training separate backbones. By replacing generative reconstruction with abstract predictive alignment, it sidesteps the difficulty of decoding noisy, multi-band, multi-resolution signals — exactly where JEPA's latent objective is strongest. It advances the field toward a genuine remote-sensing foundation model rather than a collection of sensor-specific encoders.
Strengths & limitations
- + One model spans modalities, resolutions, and scales; broadly reusable backbone.
- + Latent supervision avoids per-modality reconstruction decoders.
- + Transfers across diverse downstream EO tasks.
- − Scale-adaptive tokenization adds design complexity and assumptions about ground-sampling distance.
- − Joint training across very different sensors can dilute modality-specific detail.
- − Like all JEPAs, it is a representation learner, not a generative or action-conditioned world model.