At a glance
ProblemAnomalies occur across scales — brief spikes vs slow drifts — so a single resolution misses some, and reconstruction-based detectors are noise-sensitive.
Key ideaMulti-resolution latent forecasting: flag anomalies where the latent prediction departs from observed embeddings at the appropriate scale.
ModalityTime series (anomaly prediction)
Target / maskingTemporal blocks at multiple resolutions; targets are EMA latents of masked/future segments.
Builds onJEPA latent prediction; temporal JEPA (je-temporal).
Used forTime-series anomaly detection and prediction; operational monitoring.

Motivation

Anomalies in time series occur across very different scales: brief spikes that span a few samples versus slow drifts that unfold over long horizons. A detector operating at a single temporal resolution will miss either the fine or the coarse irregularities. Meanwhile, reconstruction-based detectors are sensitive to noise, flagging benign high-frequency fluctuation as anomalous. MTS-JEPA targets multi-resolution anomaly prediction in latent space, combining the noise-robustness of latent targets with sensitivity to anomalies at any temporal extent.

How it works

Time serieswindows · blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Time series. The input is split into a visible context and hidden targets (window-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The series is encoded at multiple temporal resolutions; at each scale, temporal blocks form the masking unit.

  • A context encoder $f_\theta$ embeds observed windows.
  • A predictor $g_\phi$ regresses the latents of masked/future segments.
  • An EMA target encoder $\bar f_\theta$ provides those target latents, with gradients stopped.

At inference, anomalies are flagged where the latent prediction departs from the observed embeddings — a large prediction error in representation space signals abnormal dynamics. The per-scale branches are combined so that deviations at any resolution contribute to the anomaly signal, catching both spikes and drifts.

The objective

At each resolution $r$, training minimizes the latent regression over masked/future segments:

$$\mathcal{L} = \sum_{r} \big\lVert\, g_\phi^{(r)}(z_{\text{ctx}}^{(r)}, m) - \operatorname{sg}\big[\bar f_\theta^{(r)}(x)_{\text{future}}\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and EMA targets. At test time the same per-resolution prediction error becomes the anomaly score: where the predicted latent diverges from the observed embedding at the appropriate scale, the segment is flagged. No raw-value reconstruction is used, so the score inherits the noise-robustness of latent targets.

Key results & what's novel

MTS-JEPA applies the JEPA recipe to a concrete monitoring problem by casting anomaly prediction as multi-scale latent forecasting: large prediction error in embedding space at the appropriate resolution indicates abnormal dynamics. This leverages JEPA's noise-robust latent targets while the multi-resolution design captures anomalies of differing temporal extent — spikes and drifts alike. The result is effective, reconstruction-free anomaly detection and prediction, extending the JEPA family toward operational time-series analytics rather than pure representation learning.

Strengths & limitations

  • + Multi-resolution design catches both short spikes and slow drifts.
  • + Reconstruction-free, noise-robust anomaly scoring via latent prediction error.
  • + Turns the JEPA objective directly into an operational monitoring signal.
  • Choice and number of resolutions, and how branches are fused, add design complexity.
  • Latent prediction error is an indirect anomaly proxy; thresholding and calibration remain non-trivial.
  • Like all such detectors, performance depends on the normal-data distribution seen in training.

Connections & references