At a glance
ProblemFoundation-style forecasting wants zero/few-shot in-context behavior, but raw-value forecasting is noise-prone and prior-fitted networks lack a structured latent space.
Key ideaLearn a JEPA latent space for time series, then run a prior-fitted network for amortized Bayesian in-context forecasting in that latent domain.
ModalityTime series
Target / maskingTemporal windows; the predictor matches future/masked-segment latents under an EMA target.
Builds onJEPA latent prediction; prior-fitted networks (PFNs).
Used forZero/few-shot in-context time-series forecasting; latent foundation models.

Motivation

Foundation-style forecasting aims for zero- and few-shot, in-context behavior across diverse series, but two obstacles stand in the way. Forecasting over raw values is noise-prone, and the model wastes effort tracking high-frequency fluctuation rather than structure. Separately, prior-fitted networks (PFNs) — transformers trained on synthetic data to perform amortized Bayesian inference — typically operate without a structured latent space, predicting directly over observations. LaT-PFN unifies the two: a JEPA supplies the structured latent space, and a PFN supplies amortized in-context inference within it.

How it works

Time serieswindows · blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copylocal loss (e.g. MLM)
Canonical JEPA schematic for Time series. The input is split into a visible context and hidden targets (window-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

LaT-PFN has two coupled components.

  • A JEPA-style stage learns a joint embedding for time series: a context encoder $f_\theta$ and EMA target encoder $\bar f_\theta$, with a predictor $g_\phi$ matching future/masked-segment latents under a representation-space loss. The masking unit is a temporal window.
  • On top of this latent space sits a PFN: a transformer trained on synthetic series drawn from a prior to perform Bayesian-style in-context prediction — now forecasting in the learned latent domain rather than over raw values.

A summary or context set of related series conditions the in-context prediction, so the PFN treats those neighbors as an implicit prior and amortizes inference at forecast time.

The objective

The JEPA stage minimizes a latent regression over masked/future windows,

$$\mathcal{L}_{\text{JEPA}} = \big\lVert\, g_\phi(z_{\text{ctx}}, m) - \operatorname{sg}\big[\bar f_\theta(x)_{\text{future}}\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and EMA target updates. The PFN is trained by the standard prior-fitted objective — predictive likelihood over synthetic series sampled from a prior — but applied in the learned latent space, so it performs amortized Bayesian in-context forecasting on JEPA embeddings rather than raw observations.

Key results & what's novel

LaT-PFN shows that JEPA representations and prior-fitted in-context learning are complementary. Forecasting in a JEPA-learned latent space denoises the target — the embedding captures abstract temporal structure instead of raw fluctuation — while the PFN supplies amortized in-context inference that enables zero-shot adaptation, turning related series into an implicit prior. The result is an efficient latent foundation model for time-series forecasting, and a template for combining self-supervised embeddings with amortized Bayesian inference that generalizes beyond the specific architecture.

Strengths & limitations

  • + Denoised forecasting by predicting in a structured latent space.
  • + Zero/few-shot in-context adaptation via the PFN's amortized Bayesian inference.
  • + A reusable template uniting self-supervised embeddings with prior-fitted networks.
  • PFN quality depends on the synthetic prior matching real series; prior misspecification hurts.
  • Two-stage design (JEPA + PFN) is more complex than end-to-end forecasters.
  • Latent forecasting can obscure interpretability of the value-space prediction.

Connections & references

Builds onI-JEPA