At a glance
ProblemEEG modelling suffers from scarce labels and low signal-to-noise, while the video-JEPA framework offers a mature, augmentation-free latent-prediction paradigm developed on a very different modality.
Key ideaTransfer the JEPA recipe proven on video to EEG, re-engineering tokenisation and spatiotemporal masking for multichannel brain signals.
ModalityEEG
Target / maskingEEG spatiotemporal segments; a target encoder supplies stop-gradient latent targets.
Builds onV-JEPA's video latent-prediction recipe.
Used forEEG brain-signal representation learning via cross-modal recipe transfer.

Motivation

EEG modelling is constrained by scarce labels and low signal-to-noise. Meanwhile the video-JEPA framework is a mature, augmentation-free latent-prediction paradigm proven on a data-rich modality. Hojjati et al. (2025) ask a methodological question: can that recipe, developed for video, be transferred to a fundamentally different modality — multichannel brain signals — and still produce useful representations? The motivation is as much about cross-modal recipe reuse as about EEG itself.

How it works

EEGchannel-time patchs · spatiotemporal segmentsContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for EEG. The input is split into a visible context and hidden targets (channel-time patch-level, spatiotemporal segments). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

The approach reuses the JEPA structure wholesale and re-engineers only what the modality demands.

  • A context encoder embeds visible signal patches.
  • A predictor predicts the latent representations of masked regions.
  • A target encoder supplies stop-gradient latent targets.

The input tokenisation and spatiotemporal masking are re-engineered for multichannel EEG rather than video frames: the masking unit becomes EEG spatiotemporal segments. The model predicts the latent representation of masked brain-signal regions, treating EEG as a structured spatiotemporal signal analogous to video — the channel axis playing a role like spatial extent, time like the frame axis.

The objective

The loss mirrors video-JEPA, applied to masked EEG spatiotemporal segments:

$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\bar f_\theta(x)_k]\,\big\rVert_2^2,$$

with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and target encoder $\bar f_\theta$. The objective is unchanged from video; only the data tokenisation and masking are adapted, which is precisely the point being tested.

Key results & what's novel

The key contribution is a demonstration of cross-modal recipe transfer: the video-JEPA machinery, with appropriate adaptation, produces useful EEG representations. This reinforces that the JEPA principle is modality-general — the same architecture and pretraining schedule that work on video can be ported onto label-poor biosignals. The strategic lesson is that one can reuse battle-tested architectures and schedules from data-rich modalities rather than designing an EEG-specific method from scratch.

Strengths & limitations

  • + Demonstrates the JEPA recipe is modality-general, transferring from video to EEG.
  • + Lets EEG borrow mature, battle-tested video architectures and schedules.
  • + Augmentation-free latent prediction suited to noisy biosignals.
  • A transfer demonstration rather than an EEG-bespoke design.
  • Tokenisation/masking still must be re-engineered for multichannel signals.
  • Learns a representation, not a dynamics/world model.

Connections & references

Builds onV-JEPA