Motivation
The tutorial addresses a recurring confusion in self-supervised and generative learning: how to model data whose explanation requires factors we never observe, and how to do so without insisting on a normalized probability distribution that is intractable to compute over high-dimensional signals. Many methods quietly assume a tractable likelihood; the tutorial steps back and offers a single, principled lens — the energy-based model — that subsumes and clarifies the autonomous-intelligence proposal underpinning JEPA. The aim is pedagogical: to make precise why latent prediction in representation space is a sensible objective and how it relates to more familiar contrastive and generative recipes.
How it works
An EBM assigns a scalar energy $E(x,y)$ to a configuration of variables: compatible pairs get low energy, incompatible ones get high energy. Prediction is no longer a forward pass but inference-as-optimization,
$$\check{y} = \arg\min_y E(x,y).$$
Latent-variable EBMs introduce a latent $z$ that captures unobserved explanatory factors, with a free energy obtained by minimizing or marginalizing over $z$:
$$F(x,y) = -\frac{1}{\beta}\log \int_z e^{-\beta E(x,y,z)}.$$
The tutorial dissects the central difficulty — shaping the energy so that low energy concentrates on the data manifold — and contrasts two families of solutions: contrastive methods that explicitly push energy up on negatives, and regularized/architectural methods that instead limit the volume of low-energy space.
The setup
The unifying claim is that a JEPA is precisely a latent-variable EBM in representation space. The latent variable $z$ absorbs the part of the target that is not predictable from the context, letting the predictor stay confident while the world remains genuinely uncertain. Energy here is the discrepancy between predicted and observed representations, not pixels. From this viewpoint, the diverse anti-collapse tricks of the SSL literature — stop-gradients, EMA teachers, variance/covariance regularizers — are all mechanisms for shaping the energy landscape so that it cannot collapse to a constant, low-energy-everywhere solution.
Key results & what's novel
As a tutorial, its contribution is conceptual unification rather than new benchmarks. It clarifies that prediction, generation, and self-supervised representation learning are all special cases of energy shaping, and it gives a clean taxonomy separating contrastive from architectural/regularized approaches. Its most consequential idea for world modeling is that planning reduces to energy minimization over action and latent variables: choosing actions is just searching for the configuration of lowest predicted energy. By recasting JEPA as a latent-variable EBM, it turns the family's empirical recipes into instances of a single, well-motivated design principle.
Strengths & limitations
- + A clear, unifying framework that connects SSL, generative modeling, and planning.
- + Makes the role of the latent variable in handling uncertainty explicit and intuitive.
- + Gives a principled account of why non-contrastive methods can avoid collapse.
- − Conceptual rather than empirical; provides no new model or benchmark.
- − The hard practical question — how exactly to shape the energy for a given domain — is surveyed but not solved.
- − Inference-as-optimization is elegant but can be expensive, and marginalizing over $z$ is intractable in general.