Motivation
EEG modelling is constrained by scarce labels and low signal-to-noise. Meanwhile the video-JEPA framework is a mature, augmentation-free latent-prediction paradigm proven on a data-rich modality. Hojjati et al. (2025) ask a methodological question: can that recipe, developed for video, be transferred to a fundamentally different modality — multichannel brain signals — and still produce useful representations? The motivation is as much about cross-modal recipe reuse as about EEG itself.
How it works
The approach reuses the JEPA structure wholesale and re-engineers only what the modality demands.
- A context encoder embeds visible signal patches.
- A predictor predicts the latent representations of masked regions.
- A target encoder supplies stop-gradient latent targets.
The input tokenisation and spatiotemporal masking are re-engineered for multichannel EEG rather than video frames: the masking unit becomes EEG spatiotemporal segments. The model predicts the latent representation of masked brain-signal regions, treating EEG as a structured spatiotemporal signal analogous to video — the channel axis playing a role like spatial extent, time like the frame axis.
The objective
The loss mirrors video-JEPA, applied to masked EEG spatiotemporal segments:
$$\mathcal{L} = \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\bar f_\theta(x)_k]\,\big\rVert_2^2,$$
with predictor $g_\phi$, stop-gradient $\operatorname{sg}$, and target encoder $\bar f_\theta$. The objective is unchanged from video; only the data tokenisation and masking are adapted, which is precisely the point being tested.
Key results & what's novel
The key contribution is a demonstration of cross-modal recipe transfer: the video-JEPA machinery, with appropriate adaptation, produces useful EEG representations. This reinforces that the JEPA principle is modality-general — the same architecture and pretraining schedule that work on video can be ported onto label-poor biosignals. The strategic lesson is that one can reuse battle-tested architectures and schedules from data-rich modalities rather than designing an EEG-specific method from scratch.
Strengths & limitations
- + Demonstrates the JEPA recipe is modality-general, transferring from video to EEG.
- + Lets EEG borrow mature, battle-tested video architectures and schedules.
- + Augmentation-free latent prediction suited to noisy biosignals.
- − A transfer demonstration rather than an EEG-bespoke design.
- − Tokenisation/masking still must be re-engineered for multichannel signals.
- − Learns a representation, not a dynamics/world model.