JEPAs Focus on Slow Features

At a glance

ProblemIt was unclear which factors of variation a JEPA chooses to encode when predicting masked or future content in latent space.

Key ideaLatent prediction acts as an implicit slowness prior: JEPAs preferentially capture slowly varying factors, linking them to Slow Feature Analysis.

ModalityAnalysis (images / video)

Target / maskingStandard context-to-target latent prediction with an EMA target encoder.

Builds onSlow Feature Analysis and the non-contrastive collapse-avoidance literature.

Used forExplaining why JEPA latents discard fast nuisance detail and suit world-model state spaces.

Motivation

A JEPA is trained to predict a target embedding from a context embedding, but the objective never says what the encoder should represent. Empirically JEPA latents are semantically clean and ignore high-frequency pixel detail, yet the reason had not been pinned down. JEPAs Focus on Slow Features (Sobal et al., 2022) asks this directly and connects the latent-prediction objective to Slow Feature Analysis (SFA), a classic principle that good representations should vary slowly over time or space while still being informative.

How it works

Canonical JEPA schematic for Video / image pairs. The input is split into a visible context and hidden targets (token-level, blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The argument is mechanistic. Predicting a target embedding from a context embedding rewards features that are stable across the context-target gap. A feature that changes rapidly or unpredictably between the two cannot be predicted, so it contributes loss and is implicitly suppressed; a feature that is slowly varying is easy to predict and is preferentially retained. The masked- or future-prediction structure therefore behaves like an implicit slowness regulariser on the encoder.

This reframes JEPA's inductive bias: rather than modelling every pixel, the encoder is pushed toward the low-temporal-frequency, persistent components of the signal — exactly the factors SFA was designed to extract — while fast, noisy detail receives little gradient pressure to be encoded.

The analysis

Conceptually, minimising the prediction loss

$$\min_\theta\; \mathbb{E}\big[\,\lVert g_\phi(z_{\text{ctx}}) - z_{\text{tgt}} \rVert^2\,\big]$$

penalises components of $z$ whose value at the target is poorly determined by the context. Decomposing the signal by how predictable (how slow) each factor is, the irreducible prediction error is dominated by fast, high-frequency directions. The objective thus assigns representational capacity in proportion to predictability, recovering the SFA preference for slowly varying factors as an emergent, rather than imposed, property of latent prediction.

Why it matters

The slow-feature view explains why predicting in representation space is more robust than predicting pixels: pixel reconstruction is forced to model fast nuisance detail, whereas JEPA naturally discards it. For world modeling this is foundational — a useful world model should track persistent, controllable, semantically meaningful state rather than ephemeral noise. The result situates JEPA within the long lineage of predictive and slowness-based self-supervision and motivates using JEPA latents as the state space for dynamics and planning.

Strengths & limitations

+ Gives an intuitive, testable account of JEPA's inductive bias.
+ Connects JEPA to the established SFA and temporal-coherence literature.
+ Explains the empirical robustness of latent over pixel prediction.
− "Slowness" is desirable only when slow factors are the task-relevant ones; genuinely fast but important dynamics could be under-represented.
− The argument is largely conceptual/empirical rather than a tight general proof.

Connections & references

Builds onSSL Dynamics VICReg

RelatedAvoiding Noise I-JEPA V-JEPA

Paper ↗