Overview

Theorytokens · masked blocksContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Theory. The input is split into a visible context and hidden targets (token-level, masked blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

This position paper confronts a foundational problem: how can machines learn world models, reason, and plan as efficiently as humans and animals, rather than relying on the brittle reactivity of pure pattern matching or the sample-inefficiency of reinforcement learning. LeCun argues that intelligent behavior should emerge from a single differentiable architecture organized into cooperating modules: a configurator that conditions the others for the task at hand, a perception module that estimates world state, a world model that predicts future states and fills in missing information, a cost module measuring discomfort/energy, an actor that optimizes action sequences, and a short-term memory.

Mechanically, the proposal is framed in the language of energy-based models: rather than predicting pixels, the system learns by minimizing a scalar energy that measures the incompatibility between observations and predictions. The key contribution is the Joint Embedding Predictive Architecture (JEPA), which predicts the representation of a target from the representation of a context in an abstract latent space, sidestepping the impossibility of predicting all low-level details of a high-dimensional, uncertain world. Multiple JEPAs can be stacked hierarchically to support prediction and planning at increasing levels of abstraction.

For world modeling, the paper supplies the conceptual scaffold the entire JEPA family later instantiates: latent prediction over generation, non-contrastive self-supervised learning, and model-predictive planning. It motivates avoiding representation collapse and treating uncertainty in latent space, ideas that transfer naturally to scientific domains where the relevant abstractions, not raw signals, drive understanding.