Motivation
Joint-embedding self-supervised methods, including JEPA, were designed for images, where masking spatial blocks captures the relevant structure and the input has no inherent ordering. Time series are different: they exhibit explicit temporal ordering, autocorrelation, and trends, and treating them as an unordered bag of patches throws away exactly the structure that matters. This work investigates how to make joint-embedding prediction respect time — adapting the masking and context-target arrangement so the learned representation captures genuine temporal dynamics rather than static co-occurrence.
How it works
Following the JEPA template:
- A context encoder $f_\theta$ embeds an observed window of the series.
- An EMA target encoder $\bar f_\theta$ embeds held-out segments to provide targets, with gradients stopped.
- A predictor $g_\phi$ regresses the target latents under a representation-space loss.
The masking unit is a temporal block. The central design contribution is making context-target selection temporally aware — for example predicting future or interleaved time blocks from past context, so the predictor must model temporal dependence rather than treat the series as an unordered set of patches. The arrangement along the time axis, not merely the mask ratio, becomes the lever that injects temporal structure.
The objective
For temporally selected target blocks $k=1\dots M$, the loss is the latent $\ell_2$ regression:
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The objective is the standard JEPA regression; what differs is that the masked blocks $k$ are chosen along the time axis (future/interleaved), so minimizing the loss requires modeling temporal dependence rather than spatial co-occurrence.
Key results & what's novel
The study clarifies that for time series the arrangement of context and target along the time axis — not just the masking ratio — determines whether the encoder learns genuine temporal dynamics. Aligning prediction with temporal structure (predicting future or interleaved blocks) yields representations that capture trends and dependencies which a spatial-style masking would miss. By adapting joint-embedding prediction to the sequential nature of time series, the work maps the design space for temporal JEPAs and supplies representations transferable to forecasting, classification, and anomaly tasks — anchoring JEPA's expansion into time-series SSL.
Strengths & limitations
- + Identifies temporal context-target arrangement as the key design lever for time-series JEPA.
- + Learns trends and dependencies that spatial-style masking misses.
- + Produces representations transferable across forecasting, classification, and anomaly tasks.
- − More of a design study than a single deployed model; gains depend on the chosen temporal masking scheme.
- − Optimal arrangement likely varies by series type (stationary vs. trending).
- − Does not address multi-resolution structure that anomaly-focused variants target.