V-JEPA — World Modeling

Overview

Canonical JEPA schematic for Video. The input is split into a visible context and hidden targets (token-level, masked blocks). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

V-JEPA targets unsupervised learning of general-purpose video representations. Pixel-level video prediction squanders capacity on irreducibly unpredictable detail (textures, exact pixel dynamics), and reliance on pretrained image encoders or hand-crafted augmentations limits generality. The motivation is to revisit feature prediction as a standalone objective and show it suffices for strong visual understanding from video alone.

Mechanically, video is tokenized into spatiotemporal patches. A context encoder (a ViT processing the visible tokens) embeds an unmasked subset of the clip. A predictor, conditioned on the positional information of the masked region, predicts the representations of the masked spatiotemporal tokens. The prediction targets are produced by an EMA target encoder, an exponential moving average of the context encoder applied to the full clip, with a stop-gradient on the target branch. The masking scheme uses large, multi-block spatiotemporal masks (including extended temporal extents) so the model must infer motion and object continuity rather than copy nearby frames. The objective is an $L_1$/$L_2$ latent prediction loss between predicted and EMA-target features; the asymmetric EMA target and stop-gradient prevent collapse, and notably no pixel decoder, no negatives, and no pretrained image encoders are used.

The key contribution is showing feature prediction from video produces versatile representations that transfer to motion- and appearance-centric tasks, often with frozen evaluation. For world modeling, V-JEPA is the temporal extension of the JEPA principle: by predicting in latent space it learns intuitive-physics-like structure (object permanence, dynamics) directly from passive observation, a paradigm equally suited to time-resolved scientific video such as live-cell imaging.