Motivation
Anomalies in time series occur across very different scales: brief spikes that span a few samples versus slow drifts that unfold over long horizons. A detector operating at a single temporal resolution will miss either the fine or the coarse irregularities. Meanwhile, reconstruction-based detectors are sensitive to noise, flagging benign high-frequency fluctuation as anomalous. MTS-JEPA targets multi-resolution anomaly prediction in latent space, combining the noise-robustness of latent targets with sensitivity to anomalies at any temporal extent.
How it works
The series is encoded at multiple temporal resolutions; at each scale, temporal blocks form the masking unit.
- A context encoder $f_\theta$ embeds observed windows.
- A predictor $g_\phi$ regresses the latents of masked/future segments.
- An EMA target encoder $\bar f_\theta$ provides those target latents, with gradients stopped.
At inference, anomalies are flagged where the latent prediction departs from the observed embeddings — a large prediction error in representation space signals abnormal dynamics. The per-scale branches are combined so that deviations at any resolution contribute to the anomaly signal, catching both spikes and drifts.
The objective
At each resolution $r$, training minimizes the latent regression over masked/future segments:
$$\mathcal{L} = \sum_{r} \big\lVert\, g_\phi^{(r)}(z_{\text{ctx}}^{(r)}, m) - \operatorname{sg}\big[\bar f_\theta^{(r)}(x)_{\text{future}}\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and EMA targets. At test time the same per-resolution prediction error becomes the anomaly score: where the predicted latent diverges from the observed embedding at the appropriate scale, the segment is flagged. No raw-value reconstruction is used, so the score inherits the noise-robustness of latent targets.
Key results & what's novel
MTS-JEPA applies the JEPA recipe to a concrete monitoring problem by casting anomaly prediction as multi-scale latent forecasting: large prediction error in embedding space at the appropriate resolution indicates abnormal dynamics. This leverages JEPA's noise-robust latent targets while the multi-resolution design captures anomalies of differing temporal extent — spikes and drifts alike. The result is effective, reconstruction-free anomaly detection and prediction, extending the JEPA family toward operational time-series analytics rather than pure representation learning.
Strengths & limitations
- + Multi-resolution design catches both short spikes and slow drifts.
- + Reconstruction-free, noise-robust anomaly scoring via latent prediction error.
- + Turns the JEPA objective directly into an operational monitoring signal.
- − Choice and number of resolutions, and how branches are fused, add design complexity.
- − Latent prediction error is an indirect anomaly proxy; thresholding and calibration remain non-trivial.
- − Like all such detectors, performance depends on the normal-data distribution seen in training.