Motivation
Trajectory similarity underlies many spatiotemporal applications — route clustering, nearest-route retrieval, anomaly detection in movement. Classical measures such as DTW, Hausdorff, and Fréchet distance are computationally costly (often quadratic in trajectory length) and heuristic in what they treat as "similar". Learned alternatives can be faster but typically require labeled similarity supervision, which is expensive and subjective. This T-JEPA learns trajectory representations whose embedding distances approximate similarity, fully self-supervised, replacing expensive alignment computations with cheap vector distances.
How it works
A trajectory — a sequence of spatial points or segments — is tokenized, with sub-trajectory spans as the masking unit.
- A context encoder $f_\theta$ embeds the visible portions of the trajectory.
- An EMA target encoder $\bar f_\theta$ produces the latents of masked sub-trajectory targets, with gradients stopped.
- A predictor $g_\phi$ regresses those target latents from the visible context.
There is no coordinate reconstruction and no need for ground-truth similarity labels. After pretraining, the encoder maps any trajectory to a fixed-length vector, so similarity search reduces to distance computation in the learned embedding space.
The objective
For masked sub-trajectory spans $k=1\dots M$, the loss is the latent $\ell_2$ distance:
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. Predicting masked spans in representation space forces the encoder to model the spatial and ordering regularities of movement, so geometric and route similarity becomes approximately metric in the learned space — no alignment labels required.
Key results & what's novel
This T-JEPA adapts JEPA's masked latent prediction to mobility data, providing label-free, efficient trajectory embeddings for similarity search and clustering. Because predicting masked sub-trajectory latents requires modeling how routes unfold spatially and in order, geometric and route similarity becomes approximately metric in embedding space — replacing costly alignment-based measures with vector distances. The contribution broadens the JEPA family to spatiotemporal sequence domains and shows that self-supervised latent prediction can recover a usable similarity metric without any similarity supervision.
Strengths & limitations
- + Self-supervised: no labeled similarity pairs needed.
- + Fast similarity via fixed vector distances instead of quadratic alignment.
- + Captures spatial and ordering regularities of movement.
- − Embedding distance only approximates classical measures; the induced metric may diverge from a target notion of similarity.
- − Span-masking design (span length, ratio) needs tuning for trajectory statistics.
- − Quality depends on tokenization and on how representative the pretraining trajectories are.