T-JEPA (Trajectory Similarity)

At a glance

ProblemTrajectory similarity measures (DTW, Hausdorff, Frechet) are costly and heuristic, and learned methods often need labeled similarity supervision.

Key ideaSelf-supervised trajectory embeddings via latent prediction, so embedding distance approximates trajectory similarity without labels.

ModalitySpatiotemporal trajectories (mobility data)

Target / maskingSub-trajectory spans; targets are EMA latents of masked spans.

Builds onI-JEPA / temporal JEPA latent masked prediction.

Used forTrajectory similarity search, clustering, retrieval.

Motivation

Trajectory similarity underlies many spatiotemporal applications — route clustering, nearest-route retrieval, anomaly detection in movement. Classical measures such as DTW, Hausdorff, and Fréchet distance are computationally costly (often quadratic in trajectory length) and heuristic in what they treat as "similar". Learned alternatives can be faster but typically require labeled similarity supervision, which is expensive and subjective. This T-JEPA learns trajectory representations whose embedding distances approximate similarity, fully self-supervised, replacing expensive alignment computations with cheap vector distances.

How it works

Canonical JEPA schematic for Time series. The input is split into a visible context and hidden targets (window-level, span). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

A trajectory — a sequence of spatial points or segments — is tokenized, with sub-trajectory spans as the masking unit.

A context encoder $f_\theta$ embeds the visible portions of the trajectory.
An EMA target encoder $\bar f_\theta$ produces the latents of masked sub-trajectory targets, with gradients stopped.
A predictor $g_\phi$ regresses those target latents from the visible context.

There is no coordinate reconstruction and no need for ground-truth similarity labels. After pretraining, the encoder maps any trajectory to a fixed-length vector, so similarity search reduces to distance computation in the learned embedding space.

The objective

For masked sub-trajectory spans $k=1\dots M$, the loss is the latent $\ell_2$ distance:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. Predicting masked spans in representation space forces the encoder to model the spatial and ordering regularities of movement, so geometric and route similarity becomes approximately metric in the learned space — no alignment labels required.

Key results & what's novel

This T-JEPA adapts JEPA's masked latent prediction to mobility data, providing label-free, efficient trajectory embeddings for similarity search and clustering. Because predicting masked sub-trajectory latents requires modeling how routes unfold spatially and in order, geometric and route similarity becomes approximately metric in embedding space — replacing costly alignment-based measures with vector distances. The contribution broadens the JEPA family to spatiotemporal sequence domains and shows that self-supervised latent prediction can recover a usable similarity metric without any similarity supervision.

Strengths & limitations

+ Self-supervised: no labeled similarity pairs needed.
+ Fast similarity via fixed vector distances instead of quadratic alignment.
+ Captures spatial and ordering regularities of movement.
− Embedding distance only approximates classical measures; the induced metric may diverge from a target notion of similarity.
− Span-masking design (span length, ratio) needs tuning for trajectory statistics.
− Quality depends on tokenization and on how representative the pretraining trajectories are.

Connections & references

Builds onI-JEPA

Paper ↗