Joint Embeddings Go Temporal

At a glance

ProblemJoint-embedding methods were designed for images where spatial masking captures structure; time series instead have explicit ordering, autocorrelation, and trends.

Key ideaMake joint-embedding prediction temporally aware — arrange context and target along the time axis so the predictor must model temporal dependence.

ModalityTime series

Target / maskingTemporal blocks, selected to be temporally aware (e.g. future or interleaved blocks from past context).

Builds onI-JEPA / V-JEPA masked latent prediction.

Used forTime-series forecasting, classification, anomaly representations.

Motivation

Joint-embedding self-supervised methods, including JEPA, were designed for images, where masking spatial blocks captures the relevant structure and the input has no inherent ordering. Time series are different: they exhibit explicit temporal ordering, autocorrelation, and trends, and treating them as an unordered bag of patches throws away exactly the structure that matters. This work investigates how to make joint-embedding prediction respect time — adapting the masking and context-target arrangement so the learned representation captures genuine temporal dynamics rather than static co-occurrence.

How it works

Canonical JEPA schematic for Time series. The input is split into a visible context and hidden targets (window-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

Following the JEPA template:

A context encoder $f_\theta$ embeds an observed window of the series.
An EMA target encoder $\bar f_\theta$ embeds held-out segments to provide targets, with gradients stopped.
A predictor $g_\phi$ regresses the target latents under a representation-space loss.

The masking unit is a temporal block. The central design contribution is making context-target selection temporally aware — for example predicting future or interleaved time blocks from past context, so the predictor must model temporal dependence rather than treat the series as an unordered set of patches. The arrangement along the time axis, not merely the mask ratio, becomes the lever that injects temporal structure.

The objective

For temporally selected target blocks $k=1\dots M$, the loss is the latent $\ell_2$ regression:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The objective is the standard JEPA regression; what differs is that the masked blocks $k$ are chosen along the time axis (future/interleaved), so minimizing the loss requires modeling temporal dependence rather than spatial co-occurrence.

Key results & what's novel

The study clarifies that for time series the arrangement of context and target along the time axis — not just the masking ratio — determines whether the encoder learns genuine temporal dynamics. Aligning prediction with temporal structure (predicting future or interleaved blocks) yields representations that capture trends and dependencies which a spatial-style masking would miss. By adapting joint-embedding prediction to the sequential nature of time series, the work maps the design space for temporal JEPAs and supplies representations transferable to forecasting, classification, and anomaly tasks — anchoring JEPA's expansion into time-series SSL.

Strengths & limitations

+ Identifies temporal context-target arrangement as the key design lever for time-series JEPA.
+ Learns trends and dependencies that spatial-style masking misses.
+ Produces representations transferable across forecasting, classification, and anomaly tasks.
− More of a design study than a single deployed model; gains depend on the chosen temporal masking scheme.
− Optimal arrangement likely varies by series type (stationary vs. trending).
− Does not address multi-resolution structure that anomaly-focused variants target.

Connections & references

Builds onI-JEPA V-JEPA

Paper ↗