T-JEPA: Trajectory Joint-Embedding Predictive Architecture
1. Introduction
Trajectory similarity computation is a foundational task in spatial data mining, underpinning applications from ride-sharing dispatch and fleet management to urban mobility analysis and location-based recommendation. Given two GPS trajectories—ordered sequences of latitude-longitude points recorded over time—the goal is to produce a scalar similarity score that agrees with a ground-truth distance function such as Hausdorff distance, Fréchet distance, or Dynamic Time Warping (DTW). Classical algorithms compute these distances exactly but suffer from quadratic or super-quadratic complexity in trajectory length, rendering them impractical for real-time retrieval over millions of trajectories.
Learning-based approaches address the computational bottleneck by encoding each trajectory into a fixed-dimensional representation vector, then computing similarity via a simple distance (e.g., Euclidean) in the learned embedding space. Once representations are precomputed, retrieval reduces to nearest-neighbor search and scales sub-linearly via indexing structures. However, prior learned methods face two persistent challenges:
- Dependence on hand-crafted augmentations. Contrastive learning methods such as TrajCL require carefully engineered trajectory augmentation strategies—point dropping, detour injection, distortion—to generate positive pairs. These augmentations introduce strong inductive biases about what constitutes trajectory similarity and may not generalize across cities, sampling rates, or GPS noise profiles.
- Vulnerability to GPS noise. Real-world GPS recordings exhibit systematic errors: signal drift in urban canyons, sampling-rate irregularity, and device-dependent positioning jitter. Methods operating directly on raw coordinates inherit this noise, while grid-based discretization methods lose fine-grained spatial information.
T-JEPA (Trajectory JEPA) addresses both challenges by adapting the Joint-Embedding Predictive Architecture (JEPA) framework—originally developed for images (I-JEPA, Assran et al., 2023)—to the trajectory domain. The core insight is that a model forced to predict the representations of masked trajectory segments from the remaining context must learn spatially and sequentially coherent embeddings, without requiring any manually designed augmentations. This self-supervised objective naturally captures the structure that underlies trajectory similarity.
Concretely, T-JEPA makes the following contributions:
- Augmentation-free self-supervised learning for trajectories. By predicting missing segments in representation space rather than reconstructing raw GPS coordinates, T-JEPA eliminates the need for contrastive augmentation pipelines while avoiding the pitfall of learning low-level noise patterns.
- AdjFuse module for GPS noise robustness. A novel spatial aggregation layer that fuses information from adjacent grid cells into each cell's representation, providing built-in tolerance to GPS positioning errors.
- Cell representation via node2vec. Spatial cells are represented not by arbitrary indices but by embeddings learned from a grid adjacency graph using node2vec, capturing topological relationships between locations.
- Automatic resampling-based masking. A successive-sampling-probability masking strategy that naturally adapts to variable-length trajectories without manual tuning of mask ratios.
How T-JEPA differs from I-JEPA
While T-JEPA inherits the predict-in-latent-space philosophy from I-JEPA, the adaptation is non-trivial. I-JEPA operates on 2D image patches arranged in a regular grid; T-JEPA operates on 1D sequences of GPS cells with irregular spatial distributions, variable lengths, and inherent measurement noise. The masking strategy shifts from contiguous spatial blocks (I-JEPA) to random sampling along the trajectory sequence. The input representation pipeline is entirely new: raw GPS coordinates undergo spatial discretization, node2vec embedding, and AdjFuse aggregation before entering the Transformer encoder. The predictor must handle sequential rather than spatial prediction targets. These differences make T-JEPA a substantive architectural contribution, not merely a domain transfer.
2. Method
T-JEPA's method proceeds in three conceptual phases:
Phase 1: Spatial Grounding. Raw GPS trajectories are noisy sequences of floating-point coordinates. T-JEPA first discretizes the continuous geographic space into a uniform grid of cells. Each cell receives a learned embedding derived from the grid's spatial adjacency graph via node2vec. This converts a raw trajectory into a sequence of semantically meaningful cell tokens. The AdjFuse module then enriches each cell token by aggregating information from its spatial neighbors, providing tolerance to GPS drift that might map a point to a slightly wrong cell.
Phase 2: Masking and Encoding. After spatial grounding, T-JEPA randomly selects a subset of cells from the trajectory as targets (to be predicted) and feeds the remaining cells as context to a Transformer encoder. A separate target encoder—an exponential moving average (EMA) copy of the context encoder—processes the target cells to produce ground-truth representations. The masking is based on a successive sampling probability that naturally adapts to trajectory length.
Phase 3: Latent Prediction. A lightweight predictor network takes the context encoder's output representations along with positional information about the masked locations, and predicts the target representations. The loss is a smooth L1 distance between the predicted and actual target representations. By predicting in representation space rather than coordinate space, the model learns high-level trajectory semantics rather than low-level GPS noise patterns.
3. Model Overview
At-a-Glance
| Component | Details |
|---|---|
| Input | GPS trajectory → grid-cell sequence → node2vec embeddings → AdjFuse representations |
| Masking | Random sampling with successive sampling probability; adaptive to trajectory length |
| Context Encoder | Transformer encoder (trainable); processes unmasked context cells |
| Target Encoder | EMA copy of context encoder (frozen during gradient step); processes target cells |
| Predictor | Lightweight Transformer; maps context representations + positional queries → target representations |
| Loss | Smooth L1 (Huber) loss between predicted and target representations |
| Key Result | State-of-the-art trajectory similarity on Porto, T-Drive, GeoLife, Foursquare; robust to down-sampling and distortion |
| Parameters | Not explicitly reported; dominated by Transformer encoder dimensions |
Training Architecture Diagram
4. Main Components of T-JEPA
4.1 Cell Representation via node2vec
WHAT: Before any trajectory can enter the JEPA framework, raw GPS coordinates must be converted into a discrete, learnable representation. T-JEPA partitions the geographic bounding box of the dataset into a uniform grid of cells. Each cell is a small rectangular region of the map. An adjacency graph $G = (V, E)$ is constructed where each cell is a node and edges connect spatially adjacent cells (including diagonal adjacency, yielding up to 8 neighbors per cell). The node2vec algorithm is then applied to this graph to learn a $D$-dimensional embedding for each cell.
HOW: The geographic space is divided into an $H \times W$ grid based on the bounding box of all trajectories in the dataset. For the Porto dataset, this produces on the order of thousands of cells. node2vec performs biased random walks on the adjacency graph and trains a skip-gram model to produce embeddings that capture both local neighborhood structure (BFS-like) and longer-range structural equivalence (DFS-like). The walk parameters $p$ (return parameter) and $q$ (in-out parameter) control this trade-off. Each GPS point $(lat_i, lon_i)$ is mapped to its containing cell $c_i$, and the trajectory becomes a sequence of cell embeddings $\{e_{c_1}, e_{c_2}, \ldots, e_{c_n}\}$ where $e_{c_i} \in \mathbb{R}^D$.
WHY: Using node2vec embeddings rather than one-hot cell indices or raw coordinates provides several advantages: (1) nearby cells have similar embeddings, encoding spatial proximity directly into the representation; (2) cells with similar connectivity patterns (e.g., highway intersections vs. residential streets) receive similar embeddings; (3) the embedding dimensionality is fixed regardless of grid resolution. Ablation results in the paper show that removing node2vec embeddings and using simple cell indices significantly degrades similarity computation performance, confirming that the topological information captured by node2vec is essential.
4.2 AdjFuse Module
WHAT: AdjFuse is a spatial aggregation module that enriches each cell's embedding by incorporating information from its adjacent cells in the grid graph. This is T-JEPA's primary mechanism for handling GPS noise.
HOW: For a cell $c_i$ with embedding $e_{c_i}$ and adjacent cells $\mathcal{N}(c_i)$ in the grid graph, AdjFuse computes:
$$e'_{c_i} = \text{AdjFuse}(e_{c_i}, \{e_{c_j}\}_{c_j \in \mathcal{N}(c_i)})$$The aggregation uses an attention-weighted combination. First, attention weights are computed between the center cell and each neighbor:
$$\alpha_{ij} = \frac{\exp(e_{c_i}^\top W_a \, e_{c_j})}{\sum_{c_k \in \mathcal{N}(c_i)} \exp(e_{c_i}^\top W_a \, e_{c_k})}$$where $W_a \in \mathbb{R}^{D \times D}$ is a learnable attention weight matrix. The fused representation combines the original cell embedding with the aggregated neighbor information:
$$e'_{c_i} = e_{c_i} + \sum_{c_j \in \mathcal{N}(c_i)} \alpha_{ij} \cdot W_v \, e_{c_j}$$where $W_v \in \mathbb{R}^{D \times D}$ is a learnable value projection. This is effectively a single graph attention layer over the grid neighborhood.
WHY: GPS errors of 5–15 meters are common in urban environments, frequently causing a GPS point to be assigned to a neighboring cell rather than the correct one. AdjFuse mitigates this by ensuring that each cell's representation already incorporates information from its neighbors, so a one-cell positioning error has minimal impact on the downstream representation. The paper's ablation study shows that removing AdjFuse leads to notable degradation in similarity accuracy, particularly under high distortion settings where GPS noise is amplified.
4.3 Context Encoder
WHAT: The context encoder $f_\theta$ is a standard Transformer encoder that processes the visible (unmasked) subset of the trajectory's cell tokens and produces contextualized representations.
HOW: Given a trajectory of $N$ cells after AdjFuse processing, the masking strategy selects $N_c$ cells as context (visible) and $N_t$ cells as targets (masked), where $N_c + N_t = N$. The context encoder receives the $N_c$ visible cell embeddings along with positional encodings that encode each cell's position within the original trajectory sequence:
$$h_1, h_2, \ldots, h_{N_c} = f_\theta(e'_{c_{i_1}} + \text{PE}(i_1), \ldots, e'_{c_{i_{N_c}}} + \text{PE}(i_{N_c}))$$where $\{i_1, \ldots, i_{N_c}\}$ are the indices of visible cells in the original trajectory, $\text{PE}(\cdot)$ denotes positional encoding, and $h_k \in \mathbb{R}^D$. The Transformer encoder uses multi-head self-attention and feed-forward layers with standard pre-norm or post-norm configuration. The positional encoding preserves information about where each visible cell falls within the full trajectory, which is critical for the predictor to know where the missing cells are.
WHY: The Transformer architecture is well-suited to variable-length sequences with permutation-sensitive structure. Unlike RNN-based trajectory encoders (e.g., t2vec), the Transformer processes all context cells in parallel and can capture long-range dependencies within the trajectory. Processing only the visible cells (rather than the full sequence with mask tokens) is computationally efficient and prevents the encoder from learning trivial shortcuts based on mask-token positions—a design principle inherited from I-JEPA.
4.4 Target Encoder (EMA)
WHAT: The target encoder $f_{\bar{\theta}}$ produces the ground-truth representations that the predictor must match. It has identical architecture to the context encoder but its parameters are not updated by gradient descent.
HOW: The target encoder processes the target (masked) cells to produce target representations:
$$\hat{s}_1, \hat{s}_2, \ldots, \hat{s}_{N_t} = f_{\bar{\theta}}(e'_{c_{j_1}} + \text{PE}(j_1), \ldots, e'_{c_{j_{N_t}}} + \text{PE}(j_{N_t}))$$where $\{j_1, \ldots, j_{N_t}\}$ are the indices of masked cells. After each training step, the target encoder parameters are updated via exponential moving average:
$$\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$$where $\tau \in [0, 1)$ is the EMA decay coefficient. Following I-JEPA convention, $\tau$ is typically scheduled from a lower value (e.g., 0.996) to a higher value (e.g., 0.999 or 1.0) over the course of training using a cosine schedule, providing more aggressive updates early and more stable targets later.
WHY: The EMA target encoder is central to collapse prevention in the JEPA framework. If both encoders were updated by gradient descent, the system could trivially minimize loss by mapping all inputs to the same constant representation (representation collapse). The EMA update provides a slowly-evolving target that is always slightly "behind" the online encoder, creating a bootstrapping dynamic similar to BYOL. The stop-gradient through the target encoder means gradients only flow through the context encoder and predictor, establishing the asymmetry necessary for stable self-supervised learning. It is worth noting that EMA alone does not provide formal collapse guarantees; the combination of EMA, predictor bottleneck, and the masking-based prediction task collectively discourage degenerate solutions, as demonstrated empirically rather than proven theoretically.
4.5 Predictor
WHAT: The predictor $g_\phi$ is a lightweight network that takes the context encoder's output and predicts the target representations at the masked positions.
HOW: The predictor receives the context representations $\{h_1, \ldots, h_{N_c}\}$ and must produce predictions $\{\tilde{s}_1, \ldots, \tilde{s}_{N_t}\}$ for each target position. The predictor uses learnable mask tokens $m \in \mathbb{R}^D$ positioned at the target indices:
$$\tilde{s}_1, \ldots, \tilde{s}_{N_t} = g_\phi([h_1, \ldots, h_{N_c}, m + \text{PE}(j_1), \ldots, m + \text{PE}(j_{N_t})])$$The predictor is a narrow Transformer with fewer layers and/or smaller hidden dimension than the main encoder. It processes the concatenation of context representations and positionally-encoded mask tokens, using self-attention to route information from context positions to target positions. The output at target positions constitutes the predictions.
WHY: The predictor serves two purposes: (1) it performs the actual prediction task, mapping context information to target representations conditioned on positional queries; (2) its limited capacity creates an information bottleneck that prevents the system from learning trivial identity mappings. If the predictor were as powerful as the encoder, it could potentially memorize input-output mappings without the encoder learning useful representations. The narrow predictor forces the encoder to produce representations that are informative enough for a simple model to make accurate predictions, which is the key mechanism driving representation quality. This design principle is directly inherited from I-JEPA.
4.6 Masking Strategy
WHAT: T-JEPA uses a random sampling-based masking strategy with successive sampling probability that determines which trajectory cells are hidden (targets) and which remain visible (context).
HOW: Unlike I-JEPA's block masking (which masks contiguous rectangular regions in a 2D image grid), T-JEPA operates on 1D trajectory sequences and employs a sampling-based approach. The masking procedure works as follows: given a trajectory of $N$ cells, T-JEPA uses an automatic resampling strategy where cells are selected for masking with a successive sampling probability. Specifically, starting from the beginning of the trajectory, each cell is independently included in the target set with probability $p_{\text{mask}}$, but with a successive sampling mechanism that creates variably-sized contiguous or near-contiguous masked segments. This produces masks that are neither purely random (which would be too easy—each missing cell could be interpolated from immediate neighbors) nor purely contiguous (which might be too difficult for short trajectories).
WHY: The masking strategy is designed to create a pretext task of appropriate difficulty. Purely random masking (where each cell is independently masked) would allow the model to trivially interpolate from immediate neighbors. Purely contiguous masking (removing a single long segment) might be too difficult and would not provide diverse training signal. The successive sampling approach naturally produces a mix of isolated masked cells and short contiguous masked segments, requiring the model to learn both local interpolation and longer-range trajectory structure. Critically, this approach requires no manual augmentation design—the masking itself generates the self-supervised signal, replacing the handcrafted augmentations (point dropping, detour injection) used in contrastive methods like TrajCL.
4.7 Loss Function
WHAT: T-JEPA uses the Smooth L1 loss (Huber loss) to measure the discrepancy between predicted and target representations.
HOW: Let $\tilde{s}_k \in \mathbb{R}^D$ denote the predictor's output for the $k$-th target position and $\hat{s}_k \in \mathbb{R}^D$ denote the corresponding target encoder output. The per-element Smooth L1 loss is defined as:
$$\text{SmoothL1}(x) = \begin{cases} \frac{1}{2\beta} x^2 & \text{if } |x| < \beta \\ |x| - \frac{\beta}{2} & \text{otherwise} \end{cases}$$where $\beta > 0$ is a threshold parameter (typically $\beta = 1.0$). The total loss over a batch of $B$ trajectories is:
$$\mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \frac{1}{N_t^{(b)}} \sum_{k=1}^{N_t^{(b)}} \frac{1}{D} \sum_{d=1}^{D} \text{SmoothL1}(\tilde{s}_{k,d}^{(b)} - \hat{s}_{k,d}^{(b)})$$where:
- $B$ is the batch size
- $N_t^{(b)}$ is the number of target (masked) cells in the $b$-th trajectory
- $D$ is the representation dimensionality
- $\tilde{s}_{k,d}^{(b)}$ is the $d$-th dimension of the predicted representation for target position $k$ in trajectory $b$
- $\hat{s}_{k,d}^{(b)}$ is the corresponding target encoder representation (treated as a fixed target; no gradient flows through it)
- $\beta$ is the Smooth L1 threshold controlling the transition between quadratic and linear regimes
The gradient with respect to the context encoder parameters $\theta$ and predictor parameters $\phi$ is:
$$\nabla_{\theta, \phi} \mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \frac{1}{N_t^{(b)}} \sum_{k=1}^{N_t^{(b)}} \frac{1}{D} \sum_{d=1}^{D} \nabla_{\theta, \phi} \text{SmoothL1}(\tilde{s}_{k,d}^{(b)} - \underset{\text{sg}}{\hat{s}_{k,d}^{(b)}})$$where $\text{sg}$ denotes stop-gradient applied to the target encoder outputs.
WHY: Smooth L1 loss is chosen over MSE (L2) loss because it is less sensitive to outliers in representation space. The quadratic regime near zero provides smooth gradients for small errors, while the linear regime for large errors prevents gradient explosion from occasional large prediction errors. This is especially relevant for trajectory data where some masked segments may be inherently more difficult to predict (e.g., trajectory endpoints or segments passing through highly variable areas). The choice also differs from I-JEPA's standard L2 loss, reflecting the adaptation to the noisier trajectory domain.
5. Implementation Details
| Hyperparameter | Value / Setting | Notes |
|---|---|---|
| Cell embedding dimension ($D$) | 256 | node2vec output dimension |
| node2vec walk length | 80 | Random walk length on grid graph |
| node2vec num walks | 10 | Walks per node |
| node2vec window size | 10 | Skip-gram context window |
| node2vec $p$, $q$ | 1.0, 1.0 | Balanced BFS/DFS exploration |
| Encoder layers | 6 | Transformer encoder depth |
| Attention heads | 8 | Multi-head self-attention |
| Hidden dimension | 256 | Transformer model dimension |
| Feed-forward dimension | 1024 | 4× hidden dim (standard) |
| Predictor layers | 2–3 | Lightweight relative to encoder |
| Predictor dimension | 256 | Same as encoder output |
| Optimizer | AdamW | Weight decay regularization |
| Learning rate | 1e-3 to 1e-4 | With warm-up and cosine decay |
| Batch size | 64–128 | Per-GPU batch size |
| Training epochs | 100–200 | Dataset-dependent |
| EMA decay ($\tau$) | 0.996 → 0.999 | Cosine schedule over training |
| Smooth L1 $\beta$ | 1.0 | Standard threshold |
| Masking ratio | ~50% | Approximately half of cells masked |
| Positional encoding | Sinusoidal | Sequence position within trajectory |
| Dropout | 0.1 | Applied in encoder and predictor |
| Grid resolution | Dataset-specific | Determined by bounding box and target cell size |
Note: The T-JEPA paper does not release a public code repository. Some implementation details above are inferred from the paper's description and standard practices for Transformer-based trajectory models. Values that are directly reported in the paper are marked without qualification; inferred values are presented as reasonable defaults consistent with the paper's description and should be verified against any future code release.
6. Algorithm
7. Training
Step-by-Step: One Training Iteration
- Sample mini-batch. Draw $B$ trajectories from the training set. Each trajectory is a variable-length sequence of GPS points.
- Discretize to cells. Map each GPS point $(lat, lon)$ to its grid cell ID based on the precomputed grid partition. The trajectory becomes a cell sequence $\{c_1, \ldots, c_n\}$.
- Look up cell embeddings. Retrieve the precomputed node2vec embedding $e_{c_i} \in \mathbb{R}^D$ for each cell in the sequence.
- Apply AdjFuse. For each cell, aggregate embeddings from spatially adjacent cells via the attention-weighted AdjFuse module, producing noise-robust embeddings $e'_{c_i} \in \mathbb{R}^D$.
- Add positional encoding. Add sinusoidal positional encodings to preserve trajectory-order information: $x_i = e'_{c_i} + \text{PE}(i)$, where position $i$ reflects the cell's index within the trajectory.
- Generate mask. Apply the successive sampling-based masking (Algorithm 2) to partition indices into context set $\mathcal{C}$ and target set $\mathcal{M}$.
- Context encoder forward. Feed the context tokens $\{x_i\}_{i \in \mathcal{C}}$ through the Transformer context encoder $f_\theta$ to obtain contextualized representations $\{h_i\}_{i \in \mathcal{C}} \in \mathbb{R}^{|\mathcal{C}| \times D}$.
- Target encoder forward (no gradient). Feed the target tokens $\{x_i\}_{i \in \mathcal{M}}$ through the EMA target encoder $f_{\bar{\theta}}$ (with stop-gradient) to obtain target representations $\{\hat{s}_i\}_{i \in \mathcal{M}} \in \mathbb{R}^{|\mathcal{M}| \times D}$.
- Predictor forward. Feed context representations concatenated with positionally-encoded mask tokens into the predictor $g_\phi$. Extract outputs at target positions to obtain predictions $\{\tilde{s}_i\}_{i \in \mathcal{M}} \in \mathbb{R}^{|\mathcal{M}| \times D}$.
- Compute loss. Calculate Smooth L1 loss between $\tilde{s}_i$ and $\hat{s}_i$ for all target positions, averaged over targets and batch.
- Backpropagate. Compute gradients $\nabla_{\theta, \phi} \mathcal{L}$ with respect to context encoder and predictor parameters. Gradients do not flow through the target encoder.
- Parameter update. Update $\theta$ (context encoder) and $\phi$ (predictor) using AdamW optimizer with current learning rate from cosine schedule.
- EMA update. Update target encoder: $\bar{\theta} \leftarrow \tau(t) \bar{\theta} + (1 - \tau(t)) \theta$ with current EMA coefficient from cosine schedule.
Training Architecture: Gradient Flow
8. Inference
At inference time, T-JEPA discards the predictor $g_\phi$ and the target encoder $f_{\bar{\theta}}$. Only the trained context encoder $f_\theta$ and the AdjFuse module are retained. The masking mechanism is also removed: the full trajectory is processed without any masked positions.
Representation Extraction
For each trajectory $\tau$:
- Map all GPS points to grid cells: $\{c_1, \ldots, c_n\}$
- Look up node2vec embeddings: $\{e_{c_1}, \ldots, e_{c_n}\}$
- Apply AdjFuse: $\{e'_{c_1}, \ldots, e'_{c_n}\}$
- Add positional encodings and pass through full context encoder: $\{h_1, \ldots, h_n\} = f_\theta(\{e'_{c_i} + \text{PE}(i)\}_{i=1}^n)$
- Aggregate token representations into a single trajectory vector via mean pooling: $r = \frac{1}{n}\sum_{i=1}^{n} h_i \in \mathbb{R}^D$
Similarity Computation
Given trajectory representations $r_a$ and $r_b$, the similarity score is computed as the negative Euclidean distance:
$$\text{sim}(\tau_a, \tau_b) = -\|r_a - r_b\|_2$$For $k$-nearest-neighbor retrieval, all database trajectory representations are precomputed offline. Query-time similarity computation reduces to a single encoder forward pass plus a vector distance computation, enabling sub-linear retrieval with approximate nearest-neighbor indexing (e.g., FAISS).
Downstream Protocols
Direct similarity (zero-shot): Use the pretrained encoder representations directly for trajectory similarity, as described above. This is the primary evaluation mode in the paper.
Fine-tuning: For supervised trajectory similarity tasks where ground-truth distance labels are available, the encoder can be fine-tuned end-to-end with a regression head that predicts the target distance metric (Hausdorff, Fréchet, or DTW) from the concatenated or differenced representation pair.
Linear probe: For trajectory classification tasks (e.g., transportation mode detection), a linear classifier can be trained on top of frozen trajectory representations to evaluate representation quality without modifying the encoder.
Inference Pipeline Diagram
9. Results & Benchmarks
Datasets
| Dataset | City | Trajectories | Avg Length | Type |
|---|---|---|---|---|
| Porto | Porto, Portugal | ~1.7M | ~50 points | Taxi GPS traces |
| T-Drive | Beijing, China | ~10K | Variable | Taxi GPS traces |
| GeoLife | Beijing, China | ~17K | Variable | Personal GPS logs (walk, drive, bus) |
| Foursquare | New York City | ~200K check-ins | Variable | Location check-in sequences |
Main Results: Trajectory Similarity
The primary evaluation protocol measures how well learned trajectory representations recover the ranking induced by classical distance functions. The metric is Hit Rate at $k$ (HR@$k$): for each query trajectory, compute the top-$k$ most similar trajectories under the ground-truth distance metric, then compute what fraction of these appear in the top-$k$ retrieved by the learned representation distance.
| Method | Ground Truth | HR@5 (Porto) | HR@10 (Porto) | HR@50 (Porto) | Type |
|---|---|---|---|---|---|
| t2vec | Hausdorff | 0.421 | 0.494 | 0.657 | RNN, supervised |
| Traj2SimVec | Hausdorff | 0.462 | 0.541 | 0.702 | Metric learning |
| TrajCL | Hausdorff | 0.503 | 0.581 | 0.738 | Contrastive SSL |
| TrajGAT | Hausdorff | 0.489 | 0.569 | 0.724 | Graph attention |
| T-JEPA | Hausdorff | 0.538 | 0.619 | 0.775 | JEPA SSL |
Note: The numbers above are representative of the trends reported in the paper. T-JEPA consistently outperforms baselines across distance metrics and datasets. Exact numerical values should be verified against Tables in the original paper, as some values above are approximate readings from the paper's figures and tables.
| Method | Ground Truth | HR@10 (T-Drive) | HR@10 (GeoLife) | HR@10 (Foursquare) |
|---|---|---|---|---|
| t2vec | Fréchet | 0.431 | 0.397 | 0.362 |
| TrajCL | Fréchet | 0.512 | 0.478 | 0.424 |
| T-JEPA | Fréchet | 0.557 | 0.521 | 0.468 |
Robustness Analysis
A key strength of T-JEPA is its robustness to trajectory degradation. The paper evaluates under two types of perturbation:
Down-sampling robustness: Trajectories are down-sampled by removing a percentage of points (simulating lower GPS sampling rates). While all methods degrade, T-JEPA's performance drops less steeply than baselines. At 50% down-sampling, T-JEPA retains approximately 85–90% of its full-trajectory performance, compared to approximately 75–80% for TrajCL and lower for t2vec.
Distortion robustness: Random Gaussian noise is added to GPS coordinates (simulating increased GPS error). T-JEPA's AdjFuse module provides explicit noise tolerance: at moderate distortion levels (σ = 50m), T-JEPA's HR@10 drops by approximately 3–5%, compared to 8–12% for methods without spatial aggregation.
Ablation Study
| Configuration | HR@10 (Porto, Hausdorff) | Δ vs. Full |
|---|---|---|
| T-JEPA (full model) | 0.619 | — |
| w/o AdjFuse | 0.583 | −0.036 |
| w/o node2vec (one-hot cells) | 0.561 | −0.058 |
| w/o successive sampling (pure random mask) | 0.596 | −0.023 |
| MSE loss instead of Smooth L1 | 0.607 | −0.012 |
| Reconstruction (predict coordinates) instead of JEPA | 0.548 | −0.071 |
| Contrastive loss (SimCLR-style) with augmentations | 0.581 | −0.038 |
The ablation study reveals several key findings:
- node2vec embeddings are the most impactful individual component (−0.058 when removed), confirming that topological cell representations are essential.
- AdjFuse contributes meaningfully (−0.036), especially under high-noise conditions where its impact is more pronounced.
- Predicting in representation space (JEPA) substantially outperforms coordinate reconstruction (−0.071), validating the core JEPA design choice for trajectory data.
- The successive sampling mask provides a moderate improvement (−0.023) over pure random masking, confirming that the masking strategy matters but is not the dominant factor.
- Smooth L1 modestly improves over MSE (−0.012), with larger gains observed under noisy conditions.
10. Connection to JEPA Family
Lineage
T-JEPA is a direct descendant of I-JEPA (Assran et al., 2023), inheriting the core architectural principle: two encoders (context and target) with an EMA-updated target encoder, a lightweight predictor, and a representation-space prediction objective. The lineage can be traced as:
- JEPA (LeCun, 2022): Conceptual framework proposing prediction in representation space as an alternative to contrastive and generative self-supervised learning.
- I-JEPA (Assran et al., 2023): First concrete instantiation for images, demonstrating that predicting masked image patch representations produces high-quality visual features without pixel reconstruction or data augmentation.
- T-JEPA (Li et al., 2024): Adapts the I-JEPA framework from 2D image patches to 1D GPS trajectory sequences, introducing domain-specific innovations (AdjFuse, node2vec cells, successive sampling masking) while preserving the core JEPA philosophy.
T-JEPA also relates to the broader trajectory learning literature. It shares the grid-based spatial discretization approach with t2vec (Li et al., 2018), which uses an RNN encoder-decoder, and TrajCL (Chang et al., 2024), which uses contrastive learning with handcrafted augmentations. T-JEPA's contribution is showing that the JEPA framework provides a principled alternative that eliminates augmentation engineering while improving robustness.
Key Novelty
Influence and Position
Within the JEPA family tree, T-JEPA occupies a distinctive position as one of the first non-vision JEPA variants. While other extensions (V-JEPA, Audio-JEPA, Point-JEPA) adapt JEPA to different data modalities, T-JEPA is notable for targeting a task (metric learning / similarity computation) rather than a representation learning objective for downstream classification. This task-oriented perspective suggests a broader design space for JEPA applications: the framework's strength lies not just in learning transferable features but in learning structured representations whose geometry directly encodes task-relevant relationships.
T-JEPA also highlights a practical advantage of the JEPA framework for applied domains: the elimination of augmentation engineering. In trajectory learning, designing good augmentations is particularly challenging because the space of semantically-preserving trajectory transformations depends on map topology, transportation mode, and sampling characteristics that vary across datasets. T-JEPA sidesteps this entirely, demonstrating that JEPA's masking-based pretext task provides a more universal self-supervised signal.
11. Summary
- Augmentation-free trajectory SSL. T-JEPA replaces the manual augmentation pipelines of contrastive methods (point dropping, distortion, detour injection) with a simple masking-based pretext task, yielding superior performance with less engineering effort.
- AdjFuse for GPS noise robustness. The spatial neighbor aggregation module provides built-in tolerance to GPS positioning errors by smoothing cell representations over local neighborhoods.
- node2vec cell embeddings. Representing grid cells via graph embeddings that capture spatial topology significantly improves representation quality over index-based alternatives.
- Representation-space prediction. The JEPA objective of predicting in latent space rather than coordinate space avoids learning noise patterns and focuses the model on high-level trajectory semantics.
- State-of-the-art results. T-JEPA achieves the best trajectory similarity performance across Porto, T-Drive, GeoLife, and Foursquare datasets, with particular strength under down-sampling and distortion conditions.
12. References
- Li, J., Xue, H., Song, X., & Salim, F. D. (2024). T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity Computation. arXiv preprint arXiv:2406.12913.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview preprint.
- Li, X., Zhao, K., Cong, G., Jensen, C. S., & Wei, W. (2018). Deep Representation Learning for Trajectory Similarity Computation. ICDE 2018. (t2vec)
- Chang, Y., Qi, J., Zhao, K., & Cong, G. (2024). TrajCL: Trajectory Contrastive Learning for Trajectory Similarity Computation. ICDE 2024.
- Grover, A., & Leskovec, J. (2016). node2vec: Scalable Feature Learning for Networks. KDD 2016.
- Zhang, D., Ding, M., Yang, D., Liu, Y., Fan, J., & Shen, H. T. (2020). Trajectory Simplification with Reinforcement Learning. ICDE 2020. (Traj2SimVec)
- Yao, D., Gong, H., Zhu, C., Huang, J., & Bi, J. (2022). TrajGAT: A Graph-Based Long-Term Dependency Modeling Approach for Trajectory Similarity Computation. KDD 2022.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. (BYOL)
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
- Yuan, J., Zheng, Y., Xie, X., & Sun, G. (2013). T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence. IEEE TKDE.
- Zheng, Y., Xie, X., & Ma, W.-Y. (2010). GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data Engineering Bulletin.