JEPA SOLUTIONS

Joint Embedding Predictive Architectures — from LeCun's vision to 15 implementations — complete technical survey

What is JEPA?

Joint Embedding Predictive Architecture (JEPA) is a self-supervised learning framework proposed by Yann LeCun in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." JEPA represents a fundamental departure from both generative models (which predict pixels/tokens) and contrastive learning (which requires negative pairs). Instead, JEPA makes predictions in abstract representation space — it learns to predict the latent embedding of a target signal from the embedding of a context signal, without ever reconstructing the raw input.

The core insight is elegant: by predicting what you should know about the target rather than every pixel detail, the model naturally learns to discard unpredictable noise (textures, lighting, exact pixel values) and retain semantic, structural information. This is precisely what biological perception does — you recognize a chair by its abstract structure, not by memorizing every photon reflected from its surface.

Generative (MAE, GPT)
Predicts raw inputs (pixels, tokens). Wastes capacity on low-level details. Works well for text (discrete tokens are semantic), poorly for continuous signals (images, video).
Contrastive (SimCLR, CLIP)
Requires negative pairs to avoid collapse. Learns invariances via augmentations. Needs careful engineering of augmentation strategy per domain.
JEPA (Predictive)
Predicts in latent space. No negative pairs, no pixel reconstruction. The predictor is an information bottleneck that naturally prevents collapse. Domain-agnostic.

The Three Pillars of JEPA

Every JEPA variant shares three core components, regardless of the input modality:

Context Encoder (fθ)
A transformer (typically ViT) that encodes visible/unmasked portions of the input into latent representations. Trained via gradient descent.
Target Encoder (f̄θ)
Same architecture, parameters updated via EMA: θ̄ ← τθ̄ + (1-τ)θ. Processes the full, unmasked input. No gradient.
Predictor (gφ)
A lightweight transformer that takes context representations + mask tokens and predicts target representations. Creates the information bottleneck.
Core JEPA Architecture
Key Insight: Predictions happen in REPRESENTATION SPACE, not pixel space The model learns to predict "what matters" about the target, discarding low-level noise No negative pairs. No pixel reconstruction. No handcrafted augmentations. The predictor bottleneck prevents collapse. Input Signal context target visible Context Encoder f_θ (ViT) gradient descent z_ctx Predictor g_φ (lightweight) + mask tokens z_pred L2 Loss Target Encoder f̄_θ (EMA copy) no gradient, stop-grad full input z_target (stop gradient) θ̄ ← τθ̄ + (1-τ)θ Training Loss: L = ||z_pred - sg(z_target)||² prediction in latent space Collapse Prevention: 1. EMA target encoder 2. Predictor bottleneck 3. Multi-block masking 4. No trivial shortcut
Self-Supervised Learning Paradigms Compared
AspectGenerative (MAE)Contrastive (SimCLR, DINO)JEPA (Predictive)
Prediction spaceInput (pixels/tokens)None (alignment)Latent representations
Negative samplesNoYes (or momentum)No
AugmentationsMaskingHeavy (crop, color, etc.)Masking only
Collapse avoidanceN/A (reconstructive)Negatives / momentumPredictor bottleneck + EMA
Capacity wasteHigh (pixel details)LowLow (semantic only)
Low-shot transferModerateGoodExcellent
Domain flexibilityGoodNeeds augmentation designExcellent (masking is universal)
Mathematical Formulation

Loss Function

L = (1/|M|) ∑i∈M || gφ(fθ(xctx), mi) - sg(f̄θ(x)i) ||2

Where M is the set of masked positions, gφ is the predictor, fθ is the context encoder, θ is the EMA target encoder, and sg() is stop-gradient.

EMA Update Rule

θ̄ ← τ · θ̄ + (1 - τ) · θ     (τ: cosine schedule 0.996 → 1.0)

Multi-Block Masking (I-JEPA)

Sample M=4 target blocks with scale 0.15-0.2 of image area and aspect ratio 0.75-1.5. Context = all remaining visible patches. Predictor receives context embeddings + learnable mask tokens with positional encoding at target positions.

JEPA Timeline
2022JEPALeCun concept Jan 23I-JEPAImages Jul 23MC-JEPAMotion+Content Feb 24V-JEPAVideo Apr 24Point-JEPA3D Points Sep 243D-JEPA3D Repr. 2024H-JEPAHierarchical Jan 25ACT-JEPAPolicy+World Jun 25V-JEPA 2Video+Plan Jul 25Audio-JEPAAudio Nov 25LeJEPAEffective Feb 26Causal-JEPACausal Mar 26V-JEPA 2.1Dense Mar 26ThinkJEPA+VLM Mar 26LeWorldWorld Image Video Audio 3D/Points Robotics World Model/Causal Architecture Variant
All JEPA Variants at a Glance
JEPA (Concept)
Theory
Year2022
AuthorYann LeCun
Key ideaPredict in representation space
I-JEPA
ImageMeta FAIR
BackboneViT-H/14 (632M)
Training16x A100, <72h, ImageNet
KeyMulti-block masking
MC-JEPA
VideoMeta FAIR
Key ideaDisentangle motion vs content
Predictors2 (motion + content)
MaskingFactored temporal + spatial
V-JEPA
VideoMeta FAIR
BackboneViT-H/16
DataVideoMix2M (~2M clips)
MaskingSpatiotemporal tubes
Point-JEPA
3D Points
Key ideaSequencer for patch ordering
InputFPS + kNN point patches
BenchmarksModelNet40, ScanObjectNN
3D-JEPA
3D Points
Key ideaContext-aware decoder
Result88.65% PB_T50_RS @150ep
MaskingGeometry-aware 3D blocks
H-JEPA
ImageHierarchy
Key ideaFPN multi-scale hierarchy
Params~13.8M (ViT-Tiny)
LossVICReg + prediction
ACT-JEPA
RoboticsWorld Model
Key ideaJoint action + observation
Result+40% world model, +10% success
InnovationUnify IL and SSL
V-JEPA 2
VideoPlanningMeta FAIR
Data>1M hours video + images
KeySSv2 77.3%, zero-shot robot
Robot<62h unlabeled Droid video
Audio-JEPA
Audio
InputMel-spectrograms @32kHz
BackboneViT for audio
Result~wav2vec 2.0, 5x less data
LeJEPA
ImageTraining Fix
Key ideaFix JEPA fragilities
InnovationVICReg + adaptive masking
PredictorWider, shallower
Causal-JEPA
CausalWorld Model
Key ideaObject-level latent interventions
Result+20% counterfactual reasoning
Efficiency1% latent features for control
V-JEPA 2.1
VideoRoboticsMeta FAIR
Key ideaDense features + deep self-sup
Grasping+20pt over V-JEPA 2 AC
Depth0.307 RMSE NYUv2 (linear)
ThinkJEPA
World ModelPlanning
Key ideaDual: JEPA + VLM reasoning
BranchesDense JEPA + VLM thinker
ResultOutperforms both baselines
LeWorldModel
World ModelControl
Params~15M trainable
Speed48x faster than foundation
Loss2 terms, 1 hyperparameter
Complete Comparison Matrix
VariantModalityMaskingPredictorLossKey InnovationCode
JEPATheory------Latent-space prediction concept--
I-JEPAImageMulti-block spatialNarrow transformerL2First implementation; multi-block maskingGitHub
H-JEPAImageMulti-block4-layer transformerVICReg+predFPN hierarchy; multi-scaleGitHub
MC-JEPAVideoFactored (time+space)2 predictorsL2 dualDisentangled motion vs content--
V-JEPAVideoSpatiotemporal tubesTransformerL2No pixel reconstruction for videoGitHub
Point-JEPA3D PointsProximity 3D blocksTransformerL2/SmoothL1Sequencer for 3D ordering--
3D-JEPA3D PointsMulti-block 3DContext-awareL2Context-aware decoder--
ACT-JEPARobot--JointAction+LatentJoint action + observation--
V-JEPA 2Video+RobotMulti-scale temporal12-block deepL2+VICReg1M hours; zero-shot robot--
Audio-JEPAAudioPatch on mel-specTransformerL2ViT on mel-spectrograms--
LeJEPAImageAdaptive curriculumWide, shallowL2+VICRegFix I-JEPA fragilities--
Causal-JEPAVideo/SimObject-levelObjectLatentObject masking = interventionsYes
V-JEPA 2.1Video+RobotDense (vis+masked)DeepDense predDense features; deep self-sup--
ThinkJEPAVideo+VLMDual-temporalJEPA+VLMHybridVLM reasoning + JEPA--
LeWorldModelPixelsTemporalLightPred+Gaussian2 losses, 1 HP, no EMA--
Detailed Analysis of Each JEPA Variant
JEPA -- The Foundational ConceptTheory
Yann LeCun, Meta AI / NYU, 2022  |  Position Paper

LeCun proposed JEPA as the world model at the center of a modular cognitive architecture for autonomous machine intelligence. The five modules: perception (extract representations), world model (JEPA) (predict future states in latent space), cost (evaluate desirability), actor (propose actions), and short-term memory.

Why Latent Prediction?

Predicting all details of the future (pixel-level) is both intractable and unnecessary. A self-driving car doesn't need to predict every leaf — it needs to predict "the car ahead will brake." JEPA achieves this by encoding observations into compact representations and predicting future representations conditioned on actions.

"The key idea is to train a world model that can predict abstract representations of future observations, rather than the observations themselves." — Y. LeCun, 2022
I-JEPA -- Image JEPAImageMeta FAIR
Assran, Duval, Misra, Bojanowski, Vincent, Rabbat, LeCun, Ballas, Jan 2023
Paper: arXiv:2301.08243  |  Code: facebookresearch/ijepa

The first concrete implementation of JEPA. Demonstrates that predicting in latent space produces representations superior to MAE for downstream tasks, especially in low-shot and transfer settings.

Architecture

  • Context Encoder: ViT-B/16, ViT-L/16, or ViT-H/14 (632M params)
  • Target Encoder: EMA copy, momentum cosine 0.996→1.0
  • Predictor: Narrow transformer (~6 layers, smaller hidden dim)

Multi-Block Masking (Key Innovation)

  • 4 target blocks: scale 0.15-0.2, aspect ratio 0.75-1.5
  • 1 context block: ~0.85 of image, with target regions removed
  • Predictor must predict from spatially distant context patches

Why I-JEPA Outperforms MAE on Transfer

MAE reconstructs pixels, forcing the encoder to retain low-level texture/color. I-JEPA predicts abstract representations, so the encoder focuses on high-level structure. This is why I-JEPA excels at object counting and depth estimation (structure, not texture).

# I-JEPA training pseudocode
x = sample_image()
ctx_patches, tgt_patches = multi_block_mask(x)

z_ctx = context_encoder(ctx_patches)          # ViT on visible patches
with no_grad():
    z_tgt = target_encoder(x)                 # Full image through EMA

z_pred = predictor(z_ctx, mask_tokens)         # Predict at target positions
loss = mse_loss(z_pred, z_tgt[tgt_positions])

loss.backward(); optimizer.step()
target_encoder.params = tau * target_encoder.params + (1-tau) * context_encoder.params
H-JEPA -- Hierarchical JEPAImageHierarchy
Jon Wiggins, 2024  |  Code: jonwiggins/H-JEPA (MIT)

Extends I-JEPA with multi-scale hierarchical representation learning via a Feature Pyramid Network (FPN). Learns representations at 3 hierarchy levels simultaneously.

Architecture

  • Encoder: ViT-Tiny (5.5M) + RoPE + Flash Attention
  • Predictor: 4-layer transformer (2.8M params)
  • FPN: 3 levels, 128-channel fusion
  • Total: ~13.8M params (8.3M trainable)
  • Loss: Combined VICReg + prediction + SigReg (sigmoid regularization)
# H-JEPA structure
src/
  models/       # encoder, predictor, H-JEPA module
  losses/       # VICReg, SigReg, combined
  masks/        # masking strategies
  data/         # datasets and transforms

# Config
model: { encoder: vit_tiny, embed_dim: 192, num_hierarchies: 3 }
loss: { type: combined }  # vicreg + prediction
MC-JEPA -- Motion-Content JEPAVideoMeta FAIR
Bardes, Ponce, LeCun, Meta FAIR / NYU / Inria, Jul 2023  |  arXiv:2307.12698

Extends JEPA to video with disentangled representations for motion and content using factored masking and two separate predictors.

Factored Masking (Key Innovation)

  • Motion masking (temporal): Entire frames masked → predictor must learn temporal dynamics
  • Content masking (spatial): Same spatial region masked across ALL frames → predictor must learn appearance
L_total = L_motion + L_content = ||g_M(f_θ(x_vis)) - sg(f̄_θ(x_masked))||² + ||g_C(f_θ(x_vis)) - sg(f̄_θ(x_masked))||²

Each predictor is a ~6-block transformer (384-dim). Both losses backpropagate through the shared context encoder.

Results

  • Something-Something v2: Motion features alone approach full-model performance (validates disentanglement)
  • Optical flow estimation competitive with unsupervised methods
  • Unifies self-supervised visual learning and optical flow in a single encoder
V-JEPA -- Video JEPAVideoMeta FAIR
Bardes, Garrido, Assran et al., Meta FAIR, Feb 2024  |  Code: facebookresearch/jepa

JEPA for self-supervised video understanding. Predicts latent representations of spatiotemporal masked regions without pixel-level reconstruction.

Architecture

  • Encoder: ViT-L/16, ViT-H/16; spatiotemporal tube tokens (2×16×16)
  • Masking: Contiguous spatiotemporal blocks, ~75-90% masking ratio
  • Data: VideoMix2M (~2M video clips)

Key Results (Frozen Encoder)

  • Kinetics-400: ~82% top-1 (frozen + attentive probe)
  • Something-Something v2: ~71.4% top-1
# V-JEPA spatiotemporal masking
video = load_video(T=16, H=224, W=224)
tokens = patchify_3d(video, patch_size=(2,16,16))

target_tubes = sample_tube_masks(
    num_targets=4, spatial_scale=(0.15, 0.2),
    temporal_span=(0.5, 1.0), aspect_ratio=(0.75, 1.5)
)
z_ctx = context_encoder(tokens[~target_tubes])
z_pred = predictor(z_ctx, mask_tokens_3d)
z_tgt = target_encoder(all_tokens)
loss = mse(z_pred, stop_grad(z_tgt[target_tubes]))
Audio-JEPAAudio
Tuncay, Labbé, Benetos, Pellegrini, IRIT-SAMoVA / QMUL, Jul 2025 (ICME 2025)  |  arXiv:2507.02915

Adapts JEPA to audio by treating mel-spectrograms as 2D images. Uses ViT backbone with random patch masking on spectrograms.

  • Input: 10-second AudioSet clips @32kHz → mel-spectrograms
  • Masking: Random patch masking on the 2D spectrogram
  • Result: Comparable to wav2vec 2.0 and data2vec with <1/5 the training data
  • Evaluated on: X-ARES suite (speech, music, environmental sound)
Audio-JEPA validates JEPA's modality-agnostic nature: the same framework that works for images and video also works for audio by simply treating spectrograms as 2D patches.
Point-JEPA3D Points
Saito, Kudeshia, Poovvancheri, Apr 2024  |  arXiv:2404.16432

JEPA for 3D point clouds. Introduces a sequencer module that orders patch embeddings by proximity for efficient context/target selection in 3D space.

Architecture

  • Tokenization: FPS (center selection) + kNN grouping + mini-PointNet embedding
  • Masking: Proximity-based 3D block masking via the sequencer
  • Advantage over Point-MAE: No reconstruction ambiguity (many point sets = same surface)
  • No handcrafted 3D augmentations needed
3D-JEPA3D Points
Hu, Cheng, Xie, Li, Zhu, Sep 2024  |  arXiv:2409.15803

A distinct approach to 3D JEPA with emphasis on context-aware decoding and geometry-aware masking.

  • Context-aware decoder: Continuously feeds context during target prediction (vs standard one-shot)
  • Multi-block 3D sampling: One informative context block + multiple representative target blocks
  • Result: 88.65% on ScanObjectNN PB_T50_RS with only 150 pretraining epochs
ACT-JEPARoboticsWorld Model
Vujinovic, Kovacevic, Jan 2025  |  arXiv:2501.14622

Bridges imitation learning and self-supervised learning by jointly predicting action sequences and latent observation sequences end-to-end.

  • Insight: "Predicting latent observation sequences effectively generalizes to predicting action sequences"
  • +40% improvement in world model understanding
  • +10% higher task success rate across all environments
  • Enables learning from unlabeled data while maintaining policy relevance
ACT-JEPA realizes LeCun's vision: using JEPA not just for understanding, but for planning and acting. Joint prediction of actions and world states creates a unified representation for perception and control.
V-JEPA 2VideoPlanningMeta FAIR
Assran, Bardes, Garrido et al. (29 authors), Jun 2025  |  arXiv:2506.09985

The major scale-up of JEPA: trained on >1 million hours of internet video. First JEPA model to demonstrate zero-shot robotic manipulation.

AspectV-JEPA 1V-JEPA 2
ScaleViT-H (632M)ViT-g (~1B+)
Data~2M clips>1M hours video+images
StagesVideo onlyImage → Video
MaskingShort-rangeMulti-scale (short+long)
Predictor~6 blocks12 blocks, wider
RobotNoneZero-shot Franka, <62h data

Key Results

  • SSv2: 77.3%  |  Epic-Kitchens: 39.7 R@5
  • Video QA (8B): 84.0 PerceptionTest, 76.9 TempCompass
  • Zero-shot robot: Object picking/placing on Franka arms across two labs, no environment-specific training
LeJEPA -- Learning Effective JEPAImageTraining Fix
Nov 2025  |  arXiv:2511.08544

Systematically diagnoses and fixes training fragilities in I-JEPA. Key finding: I-JEPA is more fragile than it appears.

Key Fixes

  • Adaptive masking curriculum: Start low, increase over training (prevents early collapse)
  • Wider, shallower predictor: More stable than I-JEPA's narrow-deep design
  • VICReg regularization: Explicit variance + covariance terms
  • Modified EMA warmup: Prevents premature target convergence
L_total = L_pred + λ_var · L_var + λ_cov · L_cov
Causal-JEPA (C-JEPA)CausalWorld Model
Nam, Le Lidec, Maes, LeCun, Balestriero, Feb 2026  |  arXiv:2602.11389

Introduces object-level masking that acts as latent interventions with counterfactual-like properties. Moving from patches to semantically meaningful objects.

  • Masking entire objects forces inference of object state from other objects → causal inductive bias
  • +20% absolute improvement in counterfactual reasoning (VQA)
  • Comparable control performance with only 1% of latent features
C-JEPA shows that HOW you mask matters as much as WHETHER you mask. Patch-level masking learns local patterns; object-level masking learns causal inter-object relationships.
V-JEPA 2.1VideoRoboticsMeta FAIR
Mur-Labadia, Muckley, Bar, Assran et al., Mar 2026  |  arXiv:2603.14482

Unlocks dense features in JEPA: spatially structured, semantically coherent, temporally consistent.

Four Innovations

  • Dense predictive loss: Both visible AND masked tokens in training loss
  • Deep self-supervision: Hierarchical loss across intermediate encoder layers
  • Multi-modal tokenizers: Unified image + video training
  • Scaling: Improved capacity + data volume
TaskV-JEPA 2V-JEPA 2.1
Graspingbaseline+20 points
Ego4D anticipation--7.71 mAP
Epic-Kitchens39.740.8 R@5
SSv277.377.7
Depth NYUv2--0.307 RMSE
ThinkJEPAWorld ModelPlanning
Zhang, Li, He, Nagarajan, Chen, Lu, Li, Fu, Mar 2026  |  arXiv:2603.22281

Combines JEPA with Vision-Language Model reasoning through a dual-temporal pathway.

  • Dense JEPA Branch: Consecutive frames, fine-grained motion
  • VLM Thinker Branch: Sparse frames, semantic reasoning
  • Hierarchical Pyramid: Aggregates multi-layer VLM outputs into JEPA-compatible features
  • Result: Outperforms both VLM-only and JEPA-only baselines on long-horizon manipulation
LeWorldModelWorld ModelControl
Maes, Le Lidec, Scieur, LeCun, Balestriero, Mar 2026  |  arXiv:2603.19312

A minimalist JEPA world model: two loss terms, one hyperparameter, no EMA, trainable on a single GPU in hours.

Radical Simplification

  • Loss: next-embedding prediction + Gaussian regularizer (replaces EMA for collapse prevention)
  • One hyperparameter (down from six in prior approaches)
  • ~15M parameters; single GPU, hours to train
  • 48x faster planning than foundation-model world models
  • Competitive on 2D and 3D control tasks
  • Latent space encodes physical structure; detects physically implausible events
L = L_prediction + λ · L_gaussian
LeWorldModel answers: how simple can a JEPA world model be? Two losses, one hyperparameter, no EMA, single GPU. This democratizes JEPA beyond large-scale labs.
Emerging Research Directions
References

[1] LeCun (2022). A Path Towards Autonomous Machine Intelligence.

[2] Assran et al. (2023). I-JEPA. arXiv:2301.08243

[3] Bardes, Ponce, LeCun (2023). MC-JEPA. arXiv:2307.12698

[4] Bardes et al. (2024). V-JEPA. Code

[5] Saito et al. (2024). Point-JEPA. arXiv:2404.16432

[6] Hu et al. (2024). 3D-JEPA. arXiv:2409.15803

[7] Wiggins (2024). H-JEPA. GitHub

[8] Vujinovic, Kovacevic (2025). ACT-JEPA. arXiv:2501.14622

[9] Assran et al. (2025). V-JEPA 2. arXiv:2506.09985

[10] Tuncay et al. (2025). Audio-JEPA. arXiv:2507.02915

[11] LeJEPA (2025). arXiv:2511.08544

[12] Nam et al. (2026). Causal-JEPA. arXiv:2602.11389

[13] Mur-Labadia et al. (2026). V-JEPA 2.1. arXiv:2603.14482

[14] Zhang et al. (2026). ThinkJEPA. arXiv:2603.22281

[15] Maes et al. (2026). LeWorldModel. arXiv:2603.19312