Predicting Gradient is Better: SSL for SAR ATR with a JEPA

At a glance

ProblemSAR automatic target recognition (ATR) has scarce labels and heavy multiplicative speckle, so pixel-reconstruction SSL learns noise rather than target structure.

Key ideaApply the JEPA template but predict gradient-domain features in latent space, emphasizing edges and scattering-center geometry over speckle.

ModalitySynthetic-aperture-radar (SAR) image chips

Target / maskingMasked target blocks whose targets are derived from spatial-gradient features rather than raw intensity.

Builds onI-JEPA's masked latent prediction with an EMA target encoder.

Used forFew-shot and label-efficient SAR-ATR transfer.

Motivation

Synthetic-aperture-radar imagery is coherent and dominated by multiplicative speckle, a grainy noise that has no semantic meaning yet carries large intensity variation. Automatic target recognition (ATR) must classify vehicles and structures from such chips, but annotated SAR data is scarce. Pixel-reconstruction self-supervision (e.g., MAE) wastes capacity fitting speckle and illumination artifacts, learning features that are not discriminative for target shape. The authors observe that what distinguishes targets is their scattering-center geometry — edges and contours of strong returns — which is far more stable than raw amplitude. The goal is a self-supervised objective that focuses on this geometry while ignoring speckle.

How it works

Canonical JEPA schematic for SAR imagery. The input is split into a visible context and hidden targets (patch-level, multi-block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The method follows the joint-embedding predictive architecture on SAR chips split into patches.

A context encoder $f_\theta$ embeds a masked (partial) view of the chip into context tokens.
An EMA target encoder $\bar f_\theta$ embeds the full chip; embeddings at held-out target-block positions form the prediction targets (stop-gradient).
A predictor $g_\phi$ maps context tokens plus positional mask tokens to the target representations.

The defining choice is what is predicted: targets are computed in a gradient domain, derived from spatial gradients of the SAR signal rather than its raw intensity. Because gradients are far more invariant to speckle and illumination than amplitudes, latent prediction of gradient features steers the encoder toward edge and backscatter-contour structure that defines targets.

The objective

Training minimizes the latent $\ell_2$ distance between predicted and target representations over masked blocks $k$:

$$\mathcal{L} = \frac{1}{M}\sum_k \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(\nabla x)_k\big]\,\big\rVert_2^2$$

where $\nabla x$ denotes the gradient-domain transform applied before target encoding, $\operatorname{sg}$ is stop-gradient, and $m_k$ is the mask token for block $k$. The target encoder is updated by EMA, $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$. No pixels are decoded and no contrastive negatives are required.

Key results & what's novel

The paper's slogan — predicting gradient is better — captures its contribution: rather than inventing a new architecture, it changes the prediction target to suit sensor physics. By regressing gradient-domain features in embedding space, the encoder learns representations attuned to scattering geometry and resistant to speckle, yielding strong label-efficient and few-shot SAR-ATR transfer without any pixel decoder. It is an early demonstration that the JEPA principle of choosing abstract targets can be specialized per modality, and one of the first applications of joint-embedding prediction to a remote-sensing sensor.

Strengths & limitations

+ Gradient targets suppress speckle, improving robustness on a noisy coherent modality.
+ Label-efficient; no pixel reconstruction or contrastive negatives.
+ Simple, physics-motivated modification of the standard JEPA recipe.
− The gradient transform is a hand-chosen prior that may discard amplitude cues useful for some targets.
− Evaluated within SAR-ATR; generality to other sensors is not established.
− Masking and block-scale choices still require tuning for SAR chip sizes.

Connections & references

Builds onI-JEPA

RelatedAnySat REJEPA X-JEPA

Paper ↗