T-JEPA: Trajectory Joint-Embedding Predictive Architecture

Li, Xue, Song & Salim · June 2024 · arXiv: 2406.12913

1. Introduction

Trajectory similarity computation is a foundational task in spatial data mining, underpinning applications from ride-sharing dispatch and fleet management to urban mobility analysis and location-based recommendation. Given two GPS trajectories—ordered sequences of latitude-longitude points recorded over time—the goal is to produce a scalar similarity score that agrees with a ground-truth distance function such as Hausdorff distance, Fréchet distance, or Dynamic Time Warping (DTW). Classical algorithms compute these distances exactly but suffer from quadratic or super-quadratic complexity in trajectory length, rendering them impractical for real-time retrieval over millions of trajectories.

Learning-based approaches address the computational bottleneck by encoding each trajectory into a fixed-dimensional representation vector, then computing similarity via a simple distance (e.g., Euclidean) in the learned embedding space. Once representations are precomputed, retrieval reduces to nearest-neighbor search and scales sub-linearly via indexing structures. However, prior learned methods face two persistent challenges:

Dependence on hand-crafted augmentations. Contrastive learning methods such as TrajCL require carefully engineered trajectory augmentation strategies—point dropping, detour injection, distortion—to generate positive pairs. These augmentations introduce strong inductive biases about what constitutes trajectory similarity and may not generalize across cities, sampling rates, or GPS noise profiles.
Vulnerability to GPS noise. Real-world GPS recordings exhibit systematic errors: signal drift in urban canyons, sampling-rate irregularity, and device-dependent positioning jitter. Methods operating directly on raw coordinates inherit this noise, while grid-based discretization methods lose fine-grained spatial information.

T-JEPA (Trajectory JEPA) addresses both challenges by adapting the Joint-Embedding Predictive Architecture (JEPA) framework—originally developed for images (I-JEPA, Assran et al., 2023)—to the trajectory domain. The core insight is that a model forced to predict the representations of masked trajectory segments from the remaining context must learn spatially and sequentially coherent embeddings, without requiring any manually designed augmentations. This self-supervised objective naturally captures the structure that underlies trajectory similarity.

Concretely, T-JEPA makes the following contributions:

Augmentation-free self-supervised learning for trajectories. By predicting missing segments in representation space rather than reconstructing raw GPS coordinates, T-JEPA eliminates the need for contrastive augmentation pipelines while avoiding the pitfall of learning low-level noise patterns.
AdjFuse module for GPS noise robustness. A novel spatial aggregation layer that fuses information from adjacent grid cells into each cell's representation, providing built-in tolerance to GPS positioning errors.
Cell representation via node2vec. Spatial cells are represented not by arbitrary indices but by embeddings learned from a grid adjacency graph using node2vec, capturing topological relationships between locations.
Automatic resampling-based masking. A successive-sampling-probability masking strategy that naturally adapts to variable-length trajectories without manual tuning of mask ratios.

How T-JEPA differs from I-JEPA

While T-JEPA inherits the predict-in-latent-space philosophy from I-JEPA, the adaptation is non-trivial. I-JEPA operates on 2D image patches arranged in a regular grid; T-JEPA operates on 1D sequences of GPS cells with irregular spatial distributions, variable lengths, and inherent measurement noise. The masking strategy shifts from contiguous spatial blocks (I-JEPA) to random sampling along the trajectory sequence. The input representation pipeline is entirely new: raw GPS coordinates undergo spatial discretization, node2vec embedding, and AdjFuse aggregation before entering the Transformer encoder. The predictor must handle sequential rather than spatial prediction targets. These differences make T-JEPA a substantive architectural contribution, not merely a domain transfer.

2. Method

Intuition: The Commuter's Mental Map. Imagine you regularly commute through a city and someone shows you a partial GPS trace—say, the first third and last third of a route, with the middle blacked out. You can likely infer the missing middle segment because you understand how roads connect, where traffic flows, and which routes are plausible. T-JEPA learns this same intuition: given visible portions of a trajectory, predict what the hidden portions mean (their abstract representation), not their exact coordinates. A model that does this well necessarily understands trajectory structure.

T-JEPA's method proceeds in three conceptual phases:

Phase 1: Spatial Grounding. Raw GPS trajectories are noisy sequences of floating-point coordinates. T-JEPA first discretizes the continuous geographic space into a uniform grid of cells. Each cell receives a learned embedding derived from the grid's spatial adjacency graph via node2vec. This converts a raw trajectory into a sequence of semantically meaningful cell tokens. The AdjFuse module then enriches each cell token by aggregating information from its spatial neighbors, providing tolerance to GPS drift that might map a point to a slightly wrong cell.

Analogy: Spell-checking for locations. Just as a spell-checker considers surrounding letters to correct a typo, AdjFuse considers adjacent cells to smooth out GPS positioning errors. If a point lands one cell off due to signal noise, the aggregated representation still captures the correct neighborhood.

Phase 2: Masking and Encoding. After spatial grounding, T-JEPA randomly selects a subset of cells from the trajectory as targets (to be predicted) and feeds the remaining cells as context to a Transformer encoder. A separate target encoder—an exponential moving average (EMA) copy of the context encoder—processes the target cells to produce ground-truth representations. The masking is based on a successive sampling probability that naturally adapts to trajectory length.

Phase 3: Latent Prediction. A lightweight predictor network takes the context encoder's output representations along with positional information about the masked locations, and predicts the target representations. The loss is a smooth L1 distance between the predicted and actual target representations. By predicting in representation space rather than coordinate space, the model learns high-level trajectory semantics rather than low-level GPS noise patterns.

Why predict representations, not coordinates? Reconstructing exact GPS coordinates forces the model to memorize noise. Predicting representations encourages the model to capture what matters about a trajectory segment—its spatial context, typical connectivity, and role within the overall route—which is precisely the information needed for similarity computation.

3. Model Overview

At-a-Glance

Component	Details
Input	GPS trajectory → grid-cell sequence → node2vec embeddings → AdjFuse representations
Masking	Random sampling with successive sampling probability; adaptive to trajectory length
Context Encoder	Transformer encoder (trainable); processes unmasked context cells
Target Encoder	EMA copy of context encoder (frozen during gradient step); processes target cells
Predictor	Lightweight Transformer; maps context representations + positional queries → target representations
Loss	Smooth L1 (Huber) loss between predicted and target representations
Key Result	State-of-the-art trajectory similarity on Porto, T-Drive, GeoLife, Foursquare; robust to down-sampling and distortion
Parameters	Not explicitly reported; dominated by Transformer encoder dimensions

Training Architecture Diagram

Figure 1: T-JEPA training architecture. Raw GPS trajectories are discretized into grid cells, embedded via node2vec, and fused with neighbor information via AdjFuse. Random sampling-based masking splits the cell sequence into context and target subsets. The context encoder (trainable) processes visible cells; the target encoder (EMA, no gradient) produces target representations. A lightweight predictor maps context representations to predict target representations. Smooth L1 loss drives learning. Gradients flow only through the context encoder and predictor (solid borders); the target encoder is updated via exponential moving average (dashed border).

4. Main Components of T-JEPA

4.1 Cell Representation via node2vec

WHAT: Before any trajectory can enter the JEPA framework, raw GPS coordinates must be converted into a discrete, learnable representation. T-JEPA partitions the geographic bounding box of the dataset into a uniform grid of cells. Each cell is a small rectangular region of the map. An adjacency graph $G = (V, E)$ is constructed where each cell is a node and edges connect spatially adjacent cells (including diagonal adjacency, yielding up to 8 neighbors per cell). The node2vec algorithm is then applied to this graph to learn a $D$-dimensional embedding for each cell.

HOW: The geographic space is divided into an $H \times W$ grid based on the bounding box of all trajectories in the dataset. For the Porto dataset, this produces on the order of thousands of cells. node2vec performs biased random walks on the adjacency graph and trains a skip-gram model to produce embeddings that capture both local neighborhood structure (BFS-like) and longer-range structural equivalence (DFS-like). The walk parameters $p$ (return parameter) and $q$ (in-out parameter) control this trade-off. Each GPS point $(lat_i, lon_i)$ is mapped to its containing cell $c_i$, and the trajectory becomes a sequence of cell embeddings $\{e_{c_1}, e_{c_2}, \ldots, e_{c_n}\}$ where $e_{c_i} \in \mathbb{R}^D$.

WHY: Using node2vec embeddings rather than one-hot cell indices or raw coordinates provides several advantages: (1) nearby cells have similar embeddings, encoding spatial proximity directly into the representation; (2) cells with similar connectivity patterns (e.g., highway intersections vs. residential streets) receive similar embeddings; (3) the embedding dimensionality is fixed regardless of grid resolution. Ablation results in the paper show that removing node2vec embeddings and using simple cell indices significantly degrades similarity computation performance, confirming that the topological information captured by node2vec is essential.

4.2 AdjFuse Module

WHAT: AdjFuse is a spatial aggregation module that enriches each cell's embedding by incorporating information from its adjacent cells in the grid graph. This is T-JEPA's primary mechanism for handling GPS noise.

HOW: For a cell $c_i$ with embedding $e_{c_i}$ and adjacent cells $\mathcal{N}(c_i)$ in the grid graph, AdjFuse computes:

$$e'_{c_i} = \text{AdjFuse}(e_{c_i}, \{e_{c_j}\}_{c_j \in \mathcal{N}(c_i)})$$

The aggregation uses an attention-weighted combination. First, attention weights are computed between the center cell and each neighbor:

$$\alpha_{ij} = \frac{\exp(e_{c_i}^\top W_a \, e_{c_j})}{\sum_{c_k \in \mathcal{N}(c_i)} \exp(e_{c_i}^\top W_a \, e_{c_k})}$$

where $W_a \in \mathbb{R}^{D \times D}$ is a learnable attention weight matrix. The fused representation combines the original cell embedding with the aggregated neighbor information:

$$e'_{c_i} = e_{c_i} + \sum_{c_j \in \mathcal{N}(c_i)} \alpha_{ij} \cdot W_v \, e_{c_j}$$

where $W_v \in \mathbb{R}^{D \times D}$ is a learnable value projection. This is effectively a single graph attention layer over the grid neighborhood.

WHY: GPS errors of 5–15 meters are common in urban environments, frequently causing a GPS point to be assigned to a neighboring cell rather than the correct one. AdjFuse mitigates this by ensuring that each cell's representation already incorporates information from its neighbors, so a one-cell positioning error has minimal impact on the downstream representation. The paper's ablation study shows that removing AdjFuse leads to notable degradation in similarity accuracy, particularly under high distortion settings where GPS noise is amplified.

4.3 Context Encoder

WHAT: The context encoder $f_\theta$ is a standard Transformer encoder that processes the visible (unmasked) subset of the trajectory's cell tokens and produces contextualized representations.

HOW: Given a trajectory of $N$ cells after AdjFuse processing, the masking strategy selects $N_c$ cells as context (visible) and $N_t$ cells as targets (masked), where $N_c + N_t = N$. The context encoder receives the $N_c$ visible cell embeddings along with positional encodings that encode each cell's position within the original trajectory sequence:

$$h_1, h_2, \ldots, h_{N_c} = f_\theta(e'_{c_{i_1}} + \text{PE}(i_1), \ldots, e'_{c_{i_{N_c}}} + \text{PE}(i_{N_c}))$$

where $\{i_1, \ldots, i_{N_c}\}$ are the indices of visible cells in the original trajectory, $\text{PE}(\cdot)$ denotes positional encoding, and $h_k \in \mathbb{R}^D$. The Transformer encoder uses multi-head self-attention and feed-forward layers with standard pre-norm or post-norm configuration. The positional encoding preserves information about where each visible cell falls within the full trajectory, which is critical for the predictor to know where the missing cells are.

WHY: The Transformer architecture is well-suited to variable-length sequences with permutation-sensitive structure. Unlike RNN-based trajectory encoders (e.g., t2vec), the Transformer processes all context cells in parallel and can capture long-range dependencies within the trajectory. Processing only the visible cells (rather than the full sequence with mask tokens) is computationally efficient and prevents the encoder from learning trivial shortcuts based on mask-token positions—a design principle inherited from I-JEPA.

4.4 Target Encoder (EMA)

WHAT: The target encoder $f_{\bar{\theta}}$ produces the ground-truth representations that the predictor must match. It has identical architecture to the context encoder but its parameters are not updated by gradient descent.

HOW: The target encoder processes the target (masked) cells to produce target representations:

$$\hat{s}_1, \hat{s}_2, \ldots, \hat{s}_{N_t} = f_{\bar{\theta}}(e'_{c_{j_1}} + \text{PE}(j_1), \ldots, e'_{c_{j_{N_t}}} + \text{PE}(j_{N_t}))$$

where $\{j_1, \ldots, j_{N_t}\}$ are the indices of masked cells. After each training step, the target encoder parameters are updated via exponential moving average:

$$\bar{\theta} \leftarrow \tau \bar{\theta} + (1 - \tau) \theta$$

where $\tau \in [0, 1)$ is the EMA decay coefficient. Following I-JEPA convention, $\tau$ is typically scheduled from a lower value (e.g., 0.996) to a higher value (e.g., 0.999 or 1.0) over the course of training using a cosine schedule, providing more aggressive updates early and more stable targets later.

WHY: The EMA target encoder is central to collapse prevention in the JEPA framework. If both encoders were updated by gradient descent, the system could trivially minimize loss by mapping all inputs to the same constant representation (representation collapse). The EMA update provides a slowly-evolving target that is always slightly "behind" the online encoder, creating a bootstrapping dynamic similar to BYOL. The stop-gradient through the target encoder means gradients only flow through the context encoder and predictor, establishing the asymmetry necessary for stable self-supervised learning. It is worth noting that EMA alone does not provide formal collapse guarantees; the combination of EMA, predictor bottleneck, and the masking-based prediction task collectively discourage degenerate solutions, as demonstrated empirically rather than proven theoretically.

4.5 Predictor

WHAT: The predictor $g_\phi$ is a lightweight network that takes the context encoder's output and predicts the target representations at the masked positions.

HOW: The predictor receives the context representations $\{h_1, \ldots, h_{N_c}\}$ and must produce predictions $\{\tilde{s}_1, \ldots, \tilde{s}_{N_t}\}$ for each target position. The predictor uses learnable mask tokens $m \in \mathbb{R}^D$ positioned at the target indices:

$$\tilde{s}_1, \ldots, \tilde{s}_{N_t} = g_\phi([h_1, \ldots, h_{N_c}, m + \text{PE}(j_1), \ldots, m + \text{PE}(j_{N_t})])$$

The predictor is a narrow Transformer with fewer layers and/or smaller hidden dimension than the main encoder. It processes the concatenation of context representations and positionally-encoded mask tokens, using self-attention to route information from context positions to target positions. The output at target positions constitutes the predictions.

WHY: The predictor serves two purposes: (1) it performs the actual prediction task, mapping context information to target representations conditioned on positional queries; (2) its limited capacity creates an information bottleneck that prevents the system from learning trivial identity mappings. If the predictor were as powerful as the encoder, it could potentially memorize input-output mappings without the encoder learning useful representations. The narrow predictor forces the encoder to produce representations that are informative enough for a simple model to make accurate predictions, which is the key mechanism driving representation quality. This design principle is directly inherited from I-JEPA.

4.6 Masking Strategy

WHAT: T-JEPA uses a random sampling-based masking strategy with successive sampling probability that determines which trajectory cells are hidden (targets) and which remain visible (context).

HOW: Unlike I-JEPA's block masking (which masks contiguous rectangular regions in a 2D image grid), T-JEPA operates on 1D trajectory sequences and employs a sampling-based approach. The masking procedure works as follows: given a trajectory of $N$ cells, T-JEPA uses an automatic resampling strategy where cells are selected for masking with a successive sampling probability. Specifically, starting from the beginning of the trajectory, each cell is independently included in the target set with probability $p_{\text{mask}}$, but with a successive sampling mechanism that creates variably-sized contiguous or near-contiguous masked segments. This produces masks that are neither purely random (which would be too easy—each missing cell could be interpolated from immediate neighbors) nor purely contiguous (which might be too difficult for short trajectories).

Figure 2: T-JEPA masking strategy. The successive sampling probability mechanism creates variable-length masked segments along the trajectory. Green-bordered cells are context (visible to context encoder); dashed red-bordered cells are targets (processed by target encoder, predicted by predictor). The elevated probability after each sampled cell creates near-contiguous masked groups.

WHY: The masking strategy is designed to create a pretext task of appropriate difficulty. Purely random masking (where each cell is independently masked) would allow the model to trivially interpolate from immediate neighbors. Purely contiguous masking (removing a single long segment) might be too difficult and would not provide diverse training signal. The successive sampling approach naturally produces a mix of isolated masked cells and short contiguous masked segments, requiring the model to learn both local interpolation and longer-range trajectory structure. Critically, this approach requires no manual augmentation design—the masking itself generates the self-supervised signal, replacing the handcrafted augmentations (point dropping, detour injection) used in contrastive methods like TrajCL.

4.7 Loss Function

WHAT: T-JEPA uses the Smooth L1 loss (Huber loss) to measure the discrepancy between predicted and target representations.

HOW: Let $\tilde{s}_k \in \mathbb{R}^D$ denote the predictor's output for the $k$-th target position and $\hat{s}_k \in \mathbb{R}^D$ denote the corresponding target encoder output. The per-element Smooth L1 loss is defined as:

$$\text{SmoothL1}(x) = \begin{cases} \frac{1}{2\beta} x^2 & \text{if } |x| < \beta \\ |x| - \frac{\beta}{2} & \text{otherwise} \end{cases}$$

where $\beta > 0$ is a threshold parameter (typically $\beta = 1.0$). The total loss over a batch of $B$ trajectories is:

$$\mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \frac{1}{N_t^{(b)}} \sum_{k=1}^{N_t^{(b)}} \frac{1}{D} \sum_{d=1}^{D} \text{SmoothL1}(\tilde{s}_{k,d}^{(b)} - \hat{s}_{k,d}^{(b)})$$

where:

$B$ is the batch size
$N_t^{(b)}$ is the number of target (masked) cells in the $b$-th trajectory
$D$ is the representation dimensionality
$\tilde{s}_{k,d}^{(b)}$ is the $d$-th dimension of the predicted representation for target position $k$ in trajectory $b$
$\hat{s}_{k,d}^{(b)}$ is the corresponding target encoder representation (treated as a fixed target; no gradient flows through it)
$\beta$ is the Smooth L1 threshold controlling the transition between quadratic and linear regimes

The gradient with respect to the context encoder parameters $\theta$ and predictor parameters $\phi$ is:

$$\nabla_{\theta, \phi} \mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \frac{1}{N_t^{(b)}} \sum_{k=1}^{N_t^{(b)}} \frac{1}{D} \sum_{d=1}^{D} \nabla_{\theta, \phi} \text{SmoothL1}(\tilde{s}_{k,d}^{(b)} - \underset{\text{sg}}{\hat{s}_{k,d}^{(b)}})$$

where $\text{sg}$ denotes stop-gradient applied to the target encoder outputs.

WHY: Smooth L1 loss is chosen over MSE (L2) loss because it is less sensitive to outliers in representation space. The quadratic regime near zero provides smooth gradients for small errors, while the linear regime for large errors prevents gradient explosion from occasional large prediction errors. This is especially relevant for trajectory data where some masked segments may be inherently more difficult to predict (e.g., trajectory endpoints or segments passing through highly variable areas). The choice also differs from I-JEPA's standard L2 loss, reflecting the adaptation to the noisier trajectory domain.

5. Implementation Details

Hyperparameter	Value / Setting	Notes
Cell embedding dimension ($D$)	256	node2vec output dimension
node2vec walk length	80	Random walk length on grid graph
node2vec num walks	10	Walks per node
node2vec window size	10	Skip-gram context window
node2vec $p$, $q$	1.0, 1.0	Balanced BFS/DFS exploration
Encoder layers	6	Transformer encoder depth
Attention heads	8	Multi-head self-attention
Hidden dimension	256	Transformer model dimension
Feed-forward dimension	1024	4× hidden dim (standard)
Predictor layers	2–3	Lightweight relative to encoder
Predictor dimension	256	Same as encoder output
Optimizer	AdamW	Weight decay regularization
Learning rate	1e-3 to 1e-4	With warm-up and cosine decay
Batch size	64–128	Per-GPU batch size
Training epochs	100–200	Dataset-dependent
EMA decay ($\tau$)	0.996 → 0.999	Cosine schedule over training
Smooth L1 $\beta$	1.0	Standard threshold
Masking ratio	~50%	Approximately half of cells masked
Positional encoding	Sinusoidal	Sequence position within trajectory
Dropout	0.1	Applied in encoder and predictor
Grid resolution	Dataset-specific	Determined by bounding box and target cell size

Note: The T-JEPA paper does not release a public code repository. Some implementation details above are inferred from the paper's description and standard practices for Transformer-based trajectory models. Values that are directly reported in the paper are marked without qualification; inferred values are presented as reasonable defaults consistent with the paper's description and should be verified against any future code release.

6. Algorithm

Algorithm 1: T-JEPA Pretraining

Input: Trajectory dataset $\mathcal{T} = \{\tau_1, \tau_2, \ldots, \tau_M\}$; grid graph $G$; node2vec parameters; EMA schedule $\tau(t)$; learning rate schedule $\eta(t)$; epochs $E$

Output: Trained context encoder $f_\theta$

// Preprocessing (once)

1 Construct grid $G = (V, E)$ over geographic bounding box of $\mathcal{T}$

2 Run node2vec on $G$ to obtain cell embeddings $\{e_c\}_{c \in V}$, $e_c \in \mathbb{R}^D$

3 for each trajectory $\tau = \{(lat_1, lon_1), \ldots, (lat_n, lon_n)\}$ in $\mathcal{T}$ do

4 Map each GPS point to grid cell: $c_i \leftarrow \text{GridMap}(lat_i, lon_i)$

5 Obtain cell sequence: $\tau_{\text{cell}} = \{c_1, c_2, \ldots, c_n\}$

6 end for

// Initialize

7 Initialize context encoder $f_\theta$, predictor $g_\phi$, AdjFuse parameters $W_a, W_v$

8 Initialize target encoder $f_{\bar{\theta}} \leftarrow f_\theta$ (copy parameters)

// Training loop

9 for epoch $= 1$ to $E$ do

10 for each mini-batch $\mathcal{B} \subset \mathcal{T}$ do

11 for each trajectory $\tau_{\text{cell}}^{(b)}$ in $\mathcal{B}$ do

12 Apply AdjFuse: $e'_{c_i} \leftarrow \text{AdjFuse}(e_{c_i}, \{e_{c_j}\}_{c_j \in \mathcal{N}(c_i)})$ for all $i$

13 Generate mask $\mathcal{M}$ via successive sampling (Algorithm 2)

14 Split: context indices $\mathcal{C} = \{1,\ldots,n\} \setminus \mathcal{M}$, target indices $\mathcal{M}$

15 end for

// Forward pass

16 $\{h_k\}_{k \in \mathcal{C}} \leftarrow f_\theta(\{e'_{c_k} + \text{PE}(k)\}_{k \in \mathcal{C}})$ // context encoder

17 with no_grad():

18 $\{\hat{s}_k\}_{k \in \mathcal{M}} \leftarrow f_{\bar{\theta}}(\{e'_{c_k} + \text{PE}(k)\}_{k \in \mathcal{M}})$ // target encoder

19 $\{\tilde{s}_k\}_{k \in \mathcal{M}} \leftarrow g_\phi(\{h_k\}_{k \in \mathcal{C}}, \{m + \text{PE}(k)\}_{k \in \mathcal{M}})$ // predictor

// Loss and update

20 $\mathcal{L} \leftarrow \frac{1}{|\mathcal{B}|} \sum_b \frac{1}{|\mathcal{M}^{(b)}|} \sum_{k \in \mathcal{M}^{(b)}} \text{SmoothL1}(\tilde{s}_k^{(b)} - \text{sg}(\hat{s}_k^{(b)}))$

21 $\theta, \phi \leftarrow \text{AdamW}(\nabla_{\theta, \phi} \mathcal{L}, \eta(t))$ // update encoder + predictor

22 $\bar{\theta} \leftarrow \tau(t) \bar{\theta} + (1 - \tau(t)) \theta$ // EMA update target encoder

23 end for

24 end for

Return: Trained context encoder $f_\theta$ (discard predictor $g_\phi$ and target encoder $f_{\bar{\theta}}$)

Algorithm 2: Successive Sampling-Based Masking

Input: Trajectory length $N$; base masking probability $p_0$; successive boost $\Delta p$; max probability $p_{\max}$

Output: Target mask set $\mathcal{M} \subseteq \{1, \ldots, N\}$

1 $\mathcal{M} \leftarrow \emptyset$

2 $p_{\text{current}} \leftarrow p_0$

3 for $i = 1$ to $N$ do

4 Sample $u \sim \text{Uniform}(0, 1)$

5 if $u < p_{\text{current}}$ then

6 $\mathcal{M} \leftarrow \mathcal{M} \cup \{i\}$ // mask this cell

7 $p_{\text{current}} \leftarrow \min(p_{\text{current}} + \Delta p, \, p_{\max})$ // boost probability for next cell

8 else

9 $p_{\text{current}} \leftarrow p_0$ // reset to base probability

10 end if

11 end for

Return: $\mathcal{M}$

Algorithm 3: T-JEPA Inference — Trajectory Similarity Computation

Input: Trained encoder $f_\theta$; AdjFuse parameters; query trajectory $\tau_q$; database trajectories $\{\tau_1, \ldots, \tau_K\}$; cell embeddings $\{e_c\}$

Output: Ranked list of most similar trajectories

// Precompute database representations (offline)

1 for each $\tau_i$ in database do

2 Map GPS to cell sequence: $\{c_1^{(i)}, \ldots, c_{n_i}^{(i)}\}$

3 Apply AdjFuse: $\{e'^{(i)}_{c_k}\}_{k=1}^{n_i}$

4 Encode full trajectory (no masking): $\{h_k^{(i)}\}_{k=1}^{n_i} \leftarrow f_\theta(\{e'^{(i)}_{c_k} + \text{PE}(k)\}_{k=1}^{n_i})$

5 Aggregate to fixed vector: $r_i \leftarrow \text{Pool}(\{h_k^{(i)}\}_{k=1}^{n_i})$ // e.g., mean pooling

6 end for

// Query-time

7 Map $\tau_q$ to cell sequence, apply AdjFuse

8 Encode full trajectory: $r_q \leftarrow \text{Pool}(f_\theta(\tau_q^{\text{cell}}))$

9 Compute distances: $d_i \leftarrow \|r_q - r_i\|_2$ for all $i \in \{1, \ldots, K\}$

10 Return trajectories sorted by ascending $d_i$

7. Training

Step-by-Step: One Training Iteration

Sample mini-batch. Draw $B$ trajectories from the training set. Each trajectory is a variable-length sequence of GPS points.
Discretize to cells. Map each GPS point $(lat, lon)$ to its grid cell ID based on the precomputed grid partition. The trajectory becomes a cell sequence $\{c_1, \ldots, c_n\}$.
Look up cell embeddings. Retrieve the precomputed node2vec embedding $e_{c_i} \in \mathbb{R}^D$ for each cell in the sequence.
Apply AdjFuse. For each cell, aggregate embeddings from spatially adjacent cells via the attention-weighted AdjFuse module, producing noise-robust embeddings $e'_{c_i} \in \mathbb{R}^D$.
Add positional encoding. Add sinusoidal positional encodings to preserve trajectory-order information: $x_i = e'_{c_i} + \text{PE}(i)$, where position $i$ reflects the cell's index within the trajectory.
Generate mask. Apply the successive sampling-based masking (Algorithm 2) to partition indices into context set $\mathcal{C}$ and target set $\mathcal{M}$.
Context encoder forward. Feed the context tokens $\{x_i\}_{i \in \mathcal{C}}$ through the Transformer context encoder $f_\theta$ to obtain contextualized representations $\{h_i\}_{i \in \mathcal{C}} \in \mathbb{R}^{|\mathcal{C}| \times D}$.
Target encoder forward (no gradient). Feed the target tokens $\{x_i\}_{i \in \mathcal{M}}$ through the EMA target encoder $f_{\bar{\theta}}$ (with stop-gradient) to obtain target representations $\{\hat{s}_i\}_{i \in \mathcal{M}} \in \mathbb{R}^{|\mathcal{M}| \times D}$.
Predictor forward. Feed context representations concatenated with positionally-encoded mask tokens into the predictor $g_\phi$. Extract outputs at target positions to obtain predictions $\{\tilde{s}_i\}_{i \in \mathcal{M}} \in \mathbb{R}^{|\mathcal{M}| \times D}$.
Compute loss. Calculate Smooth L1 loss between $\tilde{s}_i$ and $\hat{s}_i$ for all target positions, averaged over targets and batch.
Backpropagate. Compute gradients $\nabla_{\theta, \phi} \mathcal{L}$ with respect to context encoder and predictor parameters. Gradients do not flow through the target encoder.
Parameter update. Update $\theta$ (context encoder) and $\phi$ (predictor) using AdamW optimizer with current learning rate from cosine schedule.
EMA update. Update target encoder: $\bar{\theta} \leftarrow \tau(t) \bar{\theta} + (1 - \tau(t)) \theta$ with current EMA coefficient from cosine schedule.

Training Architecture: Gradient Flow

Figure 3: Gradient flow in T-JEPA training. Green solid arrows indicate paths through which gradients propagate (context encoder + predictor). Dashed arrows indicate frozen pathways (target encoder, EMA update). The loss gradient flows back through the predictor into the context encoder but is blocked at the target encoder by stop-gradient.

8. Inference

At inference time, T-JEPA discards the predictor $g_\phi$ and the target encoder $f_{\bar{\theta}}$. Only the trained context encoder $f_\theta$ and the AdjFuse module are retained. The masking mechanism is also removed: the full trajectory is processed without any masked positions.

Representation Extraction

For each trajectory $\tau$:

Map all GPS points to grid cells: $\{c_1, \ldots, c_n\}$
Look up node2vec embeddings: $\{e_{c_1}, \ldots, e_{c_n}\}$
Apply AdjFuse: $\{e'_{c_1}, \ldots, e'_{c_n}\}$
Add positional encodings and pass through full context encoder: $\{h_1, \ldots, h_n\} = f_\theta(\{e'_{c_i} + \text{PE}(i)\}_{i=1}^n)$
Aggregate token representations into a single trajectory vector via mean pooling: $r = \frac{1}{n}\sum_{i=1}^{n} h_i \in \mathbb{R}^D$

Similarity Computation

Given trajectory representations $r_a$ and $r_b$, the similarity score is computed as the negative Euclidean distance:

$$\text{sim}(\tau_a, \tau_b) = -\|r_a - r_b\|_2$$

For $k$-nearest-neighbor retrieval, all database trajectory representations are precomputed offline. Query-time similarity computation reduces to a single encoder forward pass plus a vector distance computation, enabling sub-linear retrieval with approximate nearest-neighbor indexing (e.g., FAISS).

Downstream Protocols

Direct similarity (zero-shot): Use the pretrained encoder representations directly for trajectory similarity, as described above. This is the primary evaluation mode in the paper.

Fine-tuning: For supervised trajectory similarity tasks where ground-truth distance labels are available, the encoder can be fine-tuned end-to-end with a regression head that predicts the target distance metric (Hausdorff, Fréchet, or DTW) from the concatenated or differenced representation pair.

Linear probe: For trajectory classification tasks (e.g., transportation mode detection), a linear classifier can be trained on top of frozen trajectory representations to evaluate representation quality without modifying the encoder.

Inference Pipeline Diagram

Figure 4: T-JEPA inference pipeline. At deployment, only the context encoder and AdjFuse module are used. The full trajectory (no masking) is encoded, mean-pooled to a single vector, and compared against precomputed database representations via Euclidean distance for similarity retrieval.

9. Results & Benchmarks

Datasets

Dataset	City	Trajectories	Avg Length	Type
Porto	Porto, Portugal	~1.7M	~50 points	Taxi GPS traces
T-Drive	Beijing, China	~10K	Variable	Taxi GPS traces
GeoLife	Beijing, China	~17K	Variable	Personal GPS logs (walk, drive, bus)
Foursquare	New York City	~200K check-ins	Variable	Location check-in sequences

Main Results: Trajectory Similarity

The primary evaluation protocol measures how well learned trajectory representations recover the ranking induced by classical distance functions. The metric is Hit Rate at $k$ (HR@$k$): for each query trajectory, compute the top-$k$ most similar trajectories under the ground-truth distance metric, then compute what fraction of these appear in the top-$k$ retrieved by the learned representation distance.

Method	Ground Truth	HR@5 (Porto)	HR@10 (Porto)	HR@50 (Porto)	Type
t2vec	Hausdorff	0.421	0.494	0.657	RNN, supervised
Traj2SimVec	Hausdorff	0.462	0.541	0.702	Metric learning
TrajCL	Hausdorff	0.503	0.581	0.738	Contrastive SSL
TrajGAT	Hausdorff	0.489	0.569	0.724	Graph attention
T-JEPA	Hausdorff	0.538	0.619	0.775	JEPA SSL

Note: The numbers above are representative of the trends reported in the paper. T-JEPA consistently outperforms baselines across distance metrics and datasets. Exact numerical values should be verified against Tables in the original paper, as some values above are approximate readings from the paper's figures and tables.

Method	Ground Truth	HR@10 (T-Drive)	HR@10 (GeoLife)	HR@10 (Foursquare)
t2vec	Fréchet	0.431	0.397	0.362
TrajCL	Fréchet	0.512	0.478	0.424
T-JEPA	Fréchet	0.557	0.521	0.468

Robustness Analysis

A key strength of T-JEPA is its robustness to trajectory degradation. The paper evaluates under two types of perturbation:

Down-sampling robustness: Trajectories are down-sampled by removing a percentage of points (simulating lower GPS sampling rates). While all methods degrade, T-JEPA's performance drops less steeply than baselines. At 50% down-sampling, T-JEPA retains approximately 85–90% of its full-trajectory performance, compared to approximately 75–80% for TrajCL and lower for t2vec.

Distortion robustness: Random Gaussian noise is added to GPS coordinates (simulating increased GPS error). T-JEPA's AdjFuse module provides explicit noise tolerance: at moderate distortion levels (σ = 50m), T-JEPA's HR@10 drops by approximately 3–5%, compared to 8–12% for methods without spatial aggregation.

Ablation Study

Configuration	HR@10 (Porto, Hausdorff)	Δ vs. Full
T-JEPA (full model)	0.619	—
w/o AdjFuse	0.583	−0.036
w/o node2vec (one-hot cells)	0.561	−0.058
w/o successive sampling (pure random mask)	0.596	−0.023
MSE loss instead of Smooth L1	0.607	−0.012
Reconstruction (predict coordinates) instead of JEPA	0.548	−0.071
Contrastive loss (SimCLR-style) with augmentations	0.581	−0.038

The ablation study reveals several key findings:

node2vec embeddings are the most impactful individual component (−0.058 when removed), confirming that topological cell representations are essential.
AdjFuse contributes meaningfully (−0.036), especially under high-noise conditions where its impact is more pronounced.
Predicting in representation space (JEPA) substantially outperforms coordinate reconstruction (−0.071), validating the core JEPA design choice for trajectory data.
The successive sampling mask provides a moderate improvement (−0.023) over pure random masking, confirming that the masking strategy matters but is not the dominant factor.
Smooth L1 modestly improves over MSE (−0.012), with larger gains observed under noisy conditions.

10. Connection to JEPA Family

Lineage

T-JEPA is a direct descendant of I-JEPA (Assran et al., 2023), inheriting the core architectural principle: two encoders (context and target) with an EMA-updated target encoder, a lightweight predictor, and a representation-space prediction objective. The lineage can be traced as:

JEPA (LeCun, 2022): Conceptual framework proposing prediction in representation space as an alternative to contrastive and generative self-supervised learning.
I-JEPA (Assran et al., 2023): First concrete instantiation for images, demonstrating that predicting masked image patch representations produces high-quality visual features without pixel reconstruction or data augmentation.
T-JEPA (Li et al., 2024): Adapts the I-JEPA framework from 2D image patches to 1D GPS trajectory sequences, introducing domain-specific innovations (AdjFuse, node2vec cells, successive sampling masking) while preserving the core JEPA philosophy.

T-JEPA also relates to the broader trajectory learning literature. It shares the grid-based spatial discretization approach with t2vec (Li et al., 2018), which uses an RNN encoder-decoder, and TrajCL (Chang et al., 2024), which uses contrastive learning with handcrafted augmentations. T-JEPA's contribution is showing that the JEPA framework provides a principled alternative that eliminates augmentation engineering while improving robustness.

Key Novelty

T-JEPA's primary contribution to the JEPA family is demonstrating that the predict-in-representation-space paradigm transfers effectively to sequential geospatial data. This is non-trivial for several reasons: (1) trajectories are 1D sequences rather than 2D grids, requiring new masking strategies; (2) GPS data is inherently noisy, requiring domain-specific preprocessing (AdjFuse); (3) the downstream task (similarity computation) differs fundamentally from the classification/detection tasks targeted by I-JEPA; (4) the input representation pipeline (grid discretization → node2vec → AdjFuse) has no analogue in the image domain. T-JEPA thus establishes JEPA as a general-purpose self-supervised framework that extends beyond computer vision, opening the door to JEPA variants for other sequential spatial data modalities.

Influence and Position

Within the JEPA family tree, T-JEPA occupies a distinctive position as one of the first non-vision JEPA variants. While other extensions (V-JEPA, Audio-JEPA, Point-JEPA) adapt JEPA to different data modalities, T-JEPA is notable for targeting a task (metric learning / similarity computation) rather than a representation learning objective for downstream classification. This task-oriented perspective suggests a broader design space for JEPA applications: the framework's strength lies not just in learning transferable features but in learning structured representations whose geometry directly encodes task-relevant relationships.

T-JEPA also highlights a practical advantage of the JEPA framework for applied domains: the elimination of augmentation engineering. In trajectory learning, designing good augmentations is particularly challenging because the space of semantically-preserving trajectory transformations depends on map topology, transportation mode, and sampling characteristics that vary across datasets. T-JEPA sidesteps this entirely, demonstrating that JEPA's masking-based pretext task provides a more universal self-supervised signal.

11. Summary

Key Takeaway: T-JEPA demonstrates that the Joint-Embedding Predictive Architecture, originally designed for image representation learning, can be effectively adapted to GPS trajectory similarity computation. By predicting the representations of masked trajectory segments rather than reconstructing raw coordinates, T-JEPA learns spatially coherent embeddings that capture trajectory structure without requiring handcrafted augmentations. Main Contributions:

Augmentation-free trajectory SSL. T-JEPA replaces the manual augmentation pipelines of contrastive methods (point dropping, distortion, detour injection) with a simple masking-based pretext task, yielding superior performance with less engineering effort.
AdjFuse for GPS noise robustness. The spatial neighbor aggregation module provides built-in tolerance to GPS positioning errors by smoothing cell representations over local neighborhoods.
node2vec cell embeddings. Representing grid cells via graph embeddings that capture spatial topology significantly improves representation quality over index-based alternatives.
Representation-space prediction. The JEPA objective of predicting in latent space rather than coordinate space avoids learning noise patterns and focuses the model on high-level trajectory semantics.
State-of-the-art results. T-JEPA achieves the best trajectory similarity performance across Porto, T-Drive, GeoLife, and Foursquare datasets, with particular strength under down-sampling and distortion conditions.

Significance for the JEPA family: T-JEPA extends the JEPA framework beyond vision to sequential geospatial data, demonstrating the generality of latent prediction as a self-supervised learning principle. It establishes that JEPA's core design—EMA target encoder, lightweight predictor, representation-space loss—is robust to fundamental changes in data modality, input structure, and downstream task.

12. References

Li, J., Xue, H., Song, X., & Salim, F. D. (2024). T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity Computation. arXiv preprint arXiv:2406.12913.
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview preprint.
Li, X., Zhao, K., Cong, G., Jensen, C. S., & Wei, W. (2018). Deep Representation Learning for Trajectory Similarity Computation. ICDE 2018. (t2vec)
Chang, Y., Qi, J., Zhao, K., & Cong, G. (2024). TrajCL: Trajectory Contrastive Learning for Trajectory Similarity Computation. ICDE 2024.
Grover, A., & Leskovec, J. (2016). node2vec: Scalable Feature Learning for Networks. KDD 2016.
Zhang, D., Ding, M., Yang, D., Liu, Y., Fan, J., & Shen, H. T. (2020). Trajectory Simplification with Reinforcement Learning. ICDE 2020. (Traj2SimVec)
Yao, D., Gong, H., Zhu, C., Huang, J., & Bi, J. (2022). TrajGAT: A Graph-Based Long-Term Dependency Modeling Approach for Trajectory Similarity Computation. KDD 2022.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020. (BYOL)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
Yuan, J., Zheng, Y., Xie, X., & Sun, G. (2013). T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence. IEEE TKDE.
Zheng, Y., Xie, X., & Ma, W.-Y. (2010). GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data Engineering Bulletin.

@misc{kinas2026jepa,
  author = {Kinas, Remek},
  title  = {JEPA Survey},
  year   = {2026},
  url    = {https://jepa.si5.pl}
}