Joint Embedding Predictive Architecture (JEPA) is a self-supervised learning framework proposed by Yann LeCun in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." JEPA represents a fundamental departure from both generative models (which predict pixels/tokens) and contrastive learning (which requires negative pairs). Instead, JEPA makes predictions in abstract representation space — it learns to predict the latent embedding of a target signal from the embedding of a context signal, without ever reconstructing the raw input.
The core insight is elegant: by predicting what you should know about the target rather than every pixel detail, the model naturally learns to discard unpredictable noise (textures, lighting, exact pixel values) and retain semantic, structural information. This is precisely what biological perception does — you recognize a chair by its abstract structure, not by memorizing every photon reflected from its surface.
Every JEPA variant shares three core components, regardless of the input modality:
θ̄ ← τθ̄ + (1-τ)θ. Processes the full, unmasked input. No gradient.| Aspect | Generative (MAE) | Contrastive (SimCLR, DINO) | JEPA (Predictive) |
|---|---|---|---|
| Prediction space | Input (pixels/tokens) | None (alignment) | Latent representations |
| Negative samples | No | Yes (or momentum) | No |
| Augmentations | Masking | Heavy (crop, color, etc.) | Masking only |
| Collapse avoidance | N/A (reconstructive) | Negatives / momentum | Predictor bottleneck + EMA |
| Capacity waste | High (pixel details) | Low | Low (semantic only) |
| Low-shot transfer | Moderate | Good | Excellent |
| Domain flexibility | Good | Needs augmentation design | Excellent (masking is universal) |
Where M is the set of masked positions, gφ is the predictor, fθ is the context encoder, f̄θ is the EMA target encoder, and sg() is stop-gradient.
Sample M=4 target blocks with scale 0.15-0.2 of image area and aspect ratio 0.75-1.5. Context = all remaining visible patches. Predictor receives context embeddings + learnable mask tokens with positional encoding at target positions.
| Variant | Modality | Masking | Predictor | Loss | Key Innovation | Code |
|---|---|---|---|---|---|---|
| JEPA | Theory | -- | -- | -- | Latent-space prediction concept | -- |
| I-JEPA | Image | Multi-block spatial | Narrow transformer | L2 | First implementation; multi-block masking | GitHub |
| H-JEPA | Image | Multi-block | 4-layer transformer | VICReg+pred | FPN hierarchy; multi-scale | GitHub |
| MC-JEPA | Video | Factored (time+space) | 2 predictors | L2 dual | Disentangled motion vs content | -- |
| V-JEPA | Video | Spatiotemporal tubes | Transformer | L2 | No pixel reconstruction for video | GitHub |
| Point-JEPA | 3D Points | Proximity 3D blocks | Transformer | L2/SmoothL1 | Sequencer for 3D ordering | -- |
| 3D-JEPA | 3D Points | Multi-block 3D | Context-aware | L2 | Context-aware decoder | -- |
| ACT-JEPA | Robot | -- | Joint | Action+Latent | Joint action + observation | -- |
| V-JEPA 2 | Video+Robot | Multi-scale temporal | 12-block deep | L2+VICReg | 1M hours; zero-shot robot | -- |
| Audio-JEPA | Audio | Patch on mel-spec | Transformer | L2 | ViT on mel-spectrograms | -- |
| LeJEPA | Image | Adaptive curriculum | Wide, shallow | L2+VICReg | Fix I-JEPA fragilities | -- |
| Causal-JEPA | Video/Sim | Object-level | Object | Latent | Object masking = interventions | Yes |
| V-JEPA 2.1 | Video+Robot | Dense (vis+masked) | Deep | Dense pred | Dense features; deep self-sup | -- |
| ThinkJEPA | Video+VLM | Dual-temporal | JEPA+VLM | Hybrid | VLM reasoning + JEPA | -- |
| LeWorldModel | Pixels | Temporal | Light | Pred+Gaussian | 2 losses, 1 HP, no EMA | -- |
LeCun proposed JEPA as the world model at the center of a modular cognitive architecture for autonomous machine intelligence. The five modules: perception (extract representations), world model (JEPA) (predict future states in latent space), cost (evaluate desirability), actor (propose actions), and short-term memory.
Predicting all details of the future (pixel-level) is both intractable and unnecessary. A self-driving car doesn't need to predict every leaf — it needs to predict "the car ahead will brake." JEPA achieves this by encoding observations into compact representations and predicting future representations conditioned on actions.
The first concrete implementation of JEPA. Demonstrates that predicting in latent space produces representations superior to MAE for downstream tasks, especially in low-shot and transfer settings.
MAE reconstructs pixels, forcing the encoder to retain low-level texture/color. I-JEPA predicts abstract representations, so the encoder focuses on high-level structure. This is why I-JEPA excels at object counting and depth estimation (structure, not texture).
# I-JEPA training pseudocode
x = sample_image()
ctx_patches, tgt_patches = multi_block_mask(x)
z_ctx = context_encoder(ctx_patches) # ViT on visible patches
with no_grad():
z_tgt = target_encoder(x) # Full image through EMA
z_pred = predictor(z_ctx, mask_tokens) # Predict at target positions
loss = mse_loss(z_pred, z_tgt[tgt_positions])
loss.backward(); optimizer.step()
target_encoder.params = tau * target_encoder.params + (1-tau) * context_encoder.params
Extends I-JEPA with multi-scale hierarchical representation learning via a Feature Pyramid Network (FPN). Learns representations at 3 hierarchy levels simultaneously.
# H-JEPA structure
src/
models/ # encoder, predictor, H-JEPA module
losses/ # VICReg, SigReg, combined
masks/ # masking strategies
data/ # datasets and transforms
# Config
model: { encoder: vit_tiny, embed_dim: 192, num_hierarchies: 3 }
loss: { type: combined } # vicreg + prediction
Extends JEPA to video with disentangled representations for motion and content using factored masking and two separate predictors.
Each predictor is a ~6-block transformer (384-dim). Both losses backpropagate through the shared context encoder.
JEPA for self-supervised video understanding. Predicts latent representations of spatiotemporal masked regions without pixel-level reconstruction.
# V-JEPA spatiotemporal masking
video = load_video(T=16, H=224, W=224)
tokens = patchify_3d(video, patch_size=(2,16,16))
target_tubes = sample_tube_masks(
num_targets=4, spatial_scale=(0.15, 0.2),
temporal_span=(0.5, 1.0), aspect_ratio=(0.75, 1.5)
)
z_ctx = context_encoder(tokens[~target_tubes])
z_pred = predictor(z_ctx, mask_tokens_3d)
z_tgt = target_encoder(all_tokens)
loss = mse(z_pred, stop_grad(z_tgt[target_tubes]))
Adapts JEPA to audio by treating mel-spectrograms as 2D images. Uses ViT backbone with random patch masking on spectrograms.
JEPA for 3D point clouds. Introduces a sequencer module that orders patch embeddings by proximity for efficient context/target selection in 3D space.
A distinct approach to 3D JEPA with emphasis on context-aware decoding and geometry-aware masking.
Bridges imitation learning and self-supervised learning by jointly predicting action sequences and latent observation sequences end-to-end.
The major scale-up of JEPA: trained on >1 million hours of internet video. First JEPA model to demonstrate zero-shot robotic manipulation.
| Aspect | V-JEPA 1 | V-JEPA 2 |
|---|---|---|
| Scale | ViT-H (632M) | ViT-g (~1B+) |
| Data | ~2M clips | >1M hours video+images |
| Stages | Video only | Image → Video |
| Masking | Short-range | Multi-scale (short+long) |
| Predictor | ~6 blocks | 12 blocks, wider |
| Robot | None | Zero-shot Franka, <62h data |
Systematically diagnoses and fixes training fragilities in I-JEPA. Key finding: I-JEPA is more fragile than it appears.
Introduces object-level masking that acts as latent interventions with counterfactual-like properties. Moving from patches to semantically meaningful objects.
Unlocks dense features in JEPA: spatially structured, semantically coherent, temporally consistent.
| Task | V-JEPA 2 | V-JEPA 2.1 |
|---|---|---|
| Grasping | baseline | +20 points |
| Ego4D anticipation | -- | 7.71 mAP |
| Epic-Kitchens | 39.7 | 40.8 R@5 |
| SSv2 | 77.3 | 77.7 |
| Depth NYUv2 | -- | 0.307 RMSE |
Combines JEPA with Vision-Language Model reasoning through a dual-temporal pathway.
A minimalist JEPA world model: two loss terms, one hyperparameter, no EMA, trainable on a single GPU in hours.
[1] LeCun (2022). A Path Towards Autonomous Machine Intelligence.
[2] Assran et al. (2023). I-JEPA. arXiv:2301.08243
[3] Bardes, Ponce, LeCun (2023). MC-JEPA. arXiv:2307.12698
[4] Bardes et al. (2024). V-JEPA. Code
[5] Saito et al. (2024). Point-JEPA. arXiv:2404.16432
[6] Hu et al. (2024). 3D-JEPA. arXiv:2409.15803
[7] Wiggins (2024). H-JEPA. GitHub
[8] Vujinovic, Kovacevic (2025). ACT-JEPA. arXiv:2501.14622
[9] Assran et al. (2025). V-JEPA 2. arXiv:2506.09985
[10] Tuncay et al. (2025). Audio-JEPA. arXiv:2507.02915
[11] LeJEPA (2025). arXiv:2511.08544
[12] Nam et al. (2026). Causal-JEPA. arXiv:2602.11389
[13] Mur-Labadia et al. (2026). V-JEPA 2.1. arXiv:2603.14482
[14] Zhang et al. (2026). ThinkJEPA. arXiv:2603.22281
[15] Maes et al. (2026). LeWorldModel. arXiv:2603.19312