At a glance
ProblemStandard JEPA predicts a single point estimate of the target embedding and handles collapse with ad hoc architectural tricks.
Key ideaRecast JEPA as a variational latent-variable model: the predictor outputs a distribution over target latents, derived from a principled bound.
ModalityTheory (probabilistic JEPA)
Target / maskingContext-to-target latent prediction reinterpreted as a variational posterior/predictive distribution.
Builds onJEPA's latent-prediction objective and variational latent-variable modelling.
Used forUncertainty-aware latent world models and risk-sensitive planning.

Motivation

Standard JEPA minimises a regression distance between a predicted embedding and an EMA target embedding, handling anti-collapse architecturally. But the world is stochastic and partially observed, and a point prediction cannot express uncertainty about the future. Var-JEPA (Gögl et al., 2026) reformulates JEPA in a variational framework, treating the deterministic objective as a special case of a probabilistic latent-variable model and deriving the loss from a principled bound, so collapse avoidance and uncertainty both follow from the same probabilistic foundation.

How it works

Input pairtokens · blocksContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Input pair. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

Var-JEPA introduces explicit distributions over latents. The context encoder and predictor define a variational predictive distribution over target representations rather than a single vector. Training maximises a variational objective: a likelihood-style term encourages the predicted distribution to assign high probability to the observed target embedding, while a regularisation term (entropy or prior matching) plays the anti-collapse role. The predictor thus produces a distribution over target latents, naturally capturing predictive uncertainty, and collapse avoidance becomes a consequence of the variational regulariser rather than of stop-gradient heuristics.

The objective

Training maximises a variational lower bound on the latent predictive likelihood:

$$\mathcal{L} = \mathbb{E}_{q_\phi(z\mid x_{\text{ctx}})}\big[\log p_\theta(z_{\text{tgt}}\mid z)\big] \;-\; \beta\, D_{\mathrm{KL}}\!\big(q_\phi(z\mid x_{\text{ctx}})\,\Vert\,p(z)\big).$$

The first term aligns the predicted distribution with the target embedding; the KL term toward a prior $p(z)$ regularises the latent space and prevents collapse. Recovering a Dirac predictive distribution and dropping the KL reduces this to the ordinary JEPA regression loss, exhibiting standard JEPA as a limiting case.

Why it matters

Real environments are stochastic and partially observed, so a world model should represent a distribution over future latent states, not a single prediction. Var-JEPA's probabilistic formulation provides a principled route to uncertainty-aware latent world models, enabling risk-sensitive planning and a cleaner theoretical understanding of what the JEPA objective is implicitly estimating. It also connects JEPA to the broad literature on variational autoencoders and latent-variable modelling.

Strengths & limitations

  • + Captures predictive uncertainty rather than a point estimate.
  • + Recasts anti-collapse as a principled KL/entropy regulariser.
  • + Subsumes standard JEPA as a deterministic limit.
  • Introduces a prior and a $\beta$ weight that must be chosen.
  • Variational training can be harder to optimise and may underfit if the posterior family is too restrictive.

Connections & references

Builds onI-JEPALeJEPA