At a glance
ProblemReproducing and experimenting with JEPAs usually demands multi-GPU clusters and fragmented, task-specific codebases.
Key ideaA lightweight, pedagogical library that frames JEPA as an energy-based model and runs on a single GPU across image, video and planning.
ModalityImages, video, action-conditioned control
Target / maskingLatent prediction with block masking (images), future-frame masking (video), action conditioning (planning); EMA target encoder.
Builds onI-JEPA, V-JEPA and the energy-based view of joint-embedding models.
Used forTeaching, rapid prototyping and unifying perception, dynamics and control under one objective.

Motivation

Interest in JEPAs has grown quickly, but most reference implementations were built for industrial-scale training: large multi-GPU clusters, sprawling configs, and separate codebases for images, video and planning. This raises the barrier to entry for students and small labs who want to understand the method rather than scale it. EB-JEPA (Terver et al., 2026) targets this accessibility gap with a compact, pedagogical library that runs on a single GPU and exposes the conceptual through-line from self-supervised representation learning to predictive world models in one coherent interface, accompanied by a tutorial.

What's inside

EB-JEPA frames JEPA explicitly as an energy-based model: a scalar energy $E_\theta(x,y)$ measures the compatibility of a context $x$ and a (masked or future) target $y$ in latent space, with low energy assigned to observed pairs. The shared backbone uses asymmetric encoders, an EMA target encoder and latent prediction rather than pixel reconstruction, so collapse is avoided with the standard JEPA machinery.

The library spans three canonical regimes:

  • Image representation learning in the I-JEPA style, with block masking and latent prediction.
  • Video prediction, forecasting future latent states from past context frames.
  • Action-conditioned planning, where the predictor is conditioned on actions, turning the JEPA into a latent world model for model-based control.

The objective

Across regimes, training minimises a latent prediction energy between the predictor output and the EMA target embedding:

$$\mathcal{L} = \big\lVert g_\phi\big(f_\theta(x),\,a\big) - \operatorname{sg}\big[f_{\bar\theta}(y)\big] \big\rVert^2$$

Here $f_\theta$ is the context encoder, $f_{\bar\theta}$ the EMA (stop-gradient) target encoder, $g_\phi$ the predictor, and $a$ an optional action conditioning the prediction. Setting $a$ to a mask token recovers I-JEPA; conditioning on future-frame indices recovers video prediction; conditioning on real actions yields a controllable world model. The same energy underlies all three.

Why it matters

EB-JEPA's contribution is practical and unifying rather than a new state-of-the-art number. By making the energy-based predictive objective runnable on commodity hardware and showing that the same formulation supports perception, dynamics and control, it lets newcomers see how representation learning, video forecasting and planning are instances of one idea. The tutorial and single-file interface make the JEPA-to-world-model path concrete, lowering the cost of prototyping latent world models and of teaching the energy-based framing that motivates the broader JEPA program.

Strengths & limitations

  • + Single-GPU, low-overhead, easy to read and modify.
  • + Unifies image, video and action-conditioned settings under one energy-based objective.
  • + Pedagogical: paired tutorial makes the JEPA-to-world-model link explicit.
  • Optimised for clarity and small scale, not for matching large-cluster results.
  • Retains the usual EMA/stop-gradient machinery rather than exploring heuristics-free alternatives.

Connections & references

Builds onI-JEPAV-JEPA