AuthorsBardes, Ponce, LeCun
Date2023-07
CategoryVideo
Derives fromI-JEPA
Score6.65/10 — Draft

Motion-Content JEPA (MC-JEPA)

1. Introduction

Self-supervised visual representation learning has historically been split into two largely independent lines of research. Content-focused methods — VICReg, DINO, BYOL, SimCLR — learn what objects are, building representations invariant to appearance transformations such as cropping, color jitter, and blur. Motion estimation methods — unsupervised optical flow — learn how objects move, predicting dense pixel-level correspondences between consecutive video frames. These two objectives capture fundamentally different aspects of visual understanding, yet they have been pursued in isolation: optical flow estimation does not require understanding what is moving, and content learning ignores how things move.

MC-JEPA (Motion-Content Joint-Embedding Predictive Architecture), introduced by Bardes, Ponce, and LeCun in July 2023, unifies both objectives within a single shared encoder. The key insight is that motion and content learning are not merely compatible — they are mutually beneficial. A content encoder that also reasons about motion learns features with richer spatial and temporal structure. Conversely, an optical flow estimator backed by a semantically-aware encoder produces more coherent flow fields, particularly in ambiguous regions where pure photometric matching fails.

MC-JEPA builds on two foundations. From the JEPA family, it inherits the principle of predicting in representation space rather than pixel space: the motion objective operates on pyramidal feature maps, not raw pixels, aligning it with the latent-prediction philosophy of I-JEPA. From the VICReg framework, it inherits a non-contrastive self-supervised objective that regularizes representations through variance, invariance, and covariance constraints. The novelty lies in their fusion: a single ConvNeXt-T backbone produces pyramidal features that simultaneously drive coarse-to-fine optical flow estimation and augmentation-invariant content learning.

The results validate the multi-task hypothesis. MC-JEPA achieves optical flow estimation on par with dedicated unsupervised methods (EPE of 2.81 on Sintel Clean, 2.67 on KITTI 2015) while simultaneously producing the best frozen-feature segmentation results among compared self-supervised approaches: 67.1 mIoU on Pascal VOC, 65.5 on Cityscapes, and 70.5 (J&F)m on DAVIS 2017 video object segmentation — surpassing even DINO.

In this article, we first describe the high-level method and intuition behind joint motion-content learning (Section 2), then present the complete dual-stream architecture with annotated diagrams (Section 3). We dissect each component — shared encoder, flow estimator, expander network, loss functions, and variance-covariance regularization — in Section 4. Sections 5 and 6 provide exhaustive implementation details and formal algorithms. We walk through the training procedure (Section 7) and inference protocol (Section 8) with full SVG diagrams. Section 9 presents benchmark results and ablation studies, Section 10 situates MC-JEPA within the JEPA family, and Section 11 summarizes key takeaways.

2. Method

The central question MC-JEPA asks is: can a single encoder simultaneously learn what is in a scene and how it moves?

Think of it this way: Imagine you are a film editor analyzing a movie clip. You need two kinds of understanding simultaneously. First, you need content understanding: recognizing that there is a person, a car, and a tree in the frame. Second, you need motion understanding: tracking that the person walks left while the car moves right. Traditionally, visual AI has trained separate specialists for each task — one that recognizes objects but is blind to motion, and another that tracks pixel movement but cannot tell a person from a tree. MC-JEPA trains a single "editor" who develops both skills at once, and the key discovery is that learning one skill actively improves the other.

More concretely, MC-JEPA operates through two complementary objectives applied to a shared convolutional encoder:

  1. Motion stream (M-JEPA): Given two consecutive video frames $I_t$ and $I_{t+1}$, the shared encoder produces pyramidal feature maps at six resolution levels. A coarse-to-fine flow estimator (based on PWC-Net) uses these features to predict dense optical flow fields. The training signal comes from multiple loss terms: a feature-level regression loss, a pixel-level reconstruction loss, edge-aware smoothness regularization, and cycle consistency between forward and backward flows. Crucially, the regression loss operates in feature space — the encoder must produce representations where spatial correspondence is meaningful — aligning this with the JEPA principle of latent prediction.
  2. Content stream (VICReg): Given two randomly augmented views of an ImageNet image, the shared encoder extracts features that are passed through an expander network. The VICReg loss enforces that representations of different augmented views are similar (invariance) while maintaining variance across the batch and decorrelating feature dimensions (covariance). This prevents the encoder from collapsing to trivial solutions.

Both objectives share the same ConvNeXt-T backbone. At each training iteration, a video batch drives the motion losses and an ImageNet batch drives the content loss; the combined gradient updates the shared encoder, the flow estimator, and the expander simultaneously. A variance-covariance regularization term is applied to the intermediate pyramidal features to stabilize multi-task training and prevent feature collapse at any resolution level.

Key distinction from I-JEPA: MC-JEPA does not use masking, an EMA target encoder, or a predictor network in the I-JEPA sense. There is no exponential moving average and no stop-gradient on a target branch. Instead, the motion "prediction" is the optical flow estimation itself — predicting how features should be warped between frames — and the content learning uses VICReg's non-contrastive regularization rather than a predictive target. MC-JEPA is a JEPA variant in that it predicts in representation space (feature-level flow regression), but its architectural design diverges significantly from the masking-based JEPA variants.

3. Model Overview

Architecture Diagram

MC-JEPA Training Architecture — Dual-Stream Joint Learning ── Motion Stream (M-JEPA) ── Frame t $I_t$ H×W×3 Frame t+1 $I_{t+1}$ H×W×3 Shared Encoder ConvNeXt-T ~23M params 6 pyramid levels trainable ∇ Pyramidal Features {$X_t^{(l)}, X_{t+1}^{(l)}$} l=1..6 Flow Estimator PWC-Net based ~8M params trainable ∇ Optical Flow $f_{t,t+1}$ H×W×2 Motion Losses $\mathcal{L}_{reg}$ (feature) $\mathcal{L}_{rec}$ (pixel) $\mathcal{L}_{smooth}$ $\mathcal{L}_{cycle}$ $\mathcal{L}_{vc}$ (reg.) features for $\mathcal{L}_{reg}$, $\mathcal{L}_{cycle}$, $\mathcal{L}_{vc}$ ── Content Stream (VICReg) ── View 1 aug($I$) 224×224 View 2 aug'($I$) 224×224 Shared Encoder ConvNeXt-T (same weights ↑) trainable ∇ 768-dim output Expander FC network 768→8192→8192 →8192 B×768 Embeddings $Z_1, Z_2$ B×8192 VICReg Loss $\mathcal{L}_{inv}$ (invariance) $\mathcal{L}_{var}$ (variance) $\mathcal{L}_{cov}$ (covariance) Total Loss $\alpha \mathcal{L}_{motion} + \mathcal{L}_{ssl}$ $\alpha = 0.1$ ∇ backprop to shared encoder shared weights Legend: Motion stream (video data) Content stream (ImageNet data) Shared weight link Both streams share the same ConvNeXt-T encoder. Gradients from motion and content losses are combined at each iteration.
Figure 1: MC-JEPA training architecture. The shared ConvNeXt-T encoder serves both the motion stream (top, orange arrows — optical flow estimation on video pairs) and the content stream (bottom, green arrows — VICReg on augmented ImageNet views). The total loss combines motion losses (weighted by $\alpha = 0.1$) with the VICReg self-supervised loss. Gradients from both streams flow through the shared encoder.

At-a-Glance

PropertyValue
Input typeVideo frame pairs (motion) + ImageNet images (content)
Masking strategyN/A — no masking. Motion uses optical flow prediction; content uses augmentation invariance
Encoder architectureConvNeXt-Tiny (~23M params), modified stem with 6 pyramid levels
Predictor typePWC-Net–based coarse-to-fine flow estimator (~8M params)
Loss function$\alpha(\mathcal{L}_{reg} + \mathcal{L}_{rec} + \mathcal{L}_{smooth} + \mathcal{L}_{cycle} + \mathcal{L}_{vc}) + \mathcal{L}_{ssl}$
Key result67.1 mIoU frozen segm. (VOC), 70.5 (J&F)m (DAVIS), 2.81 EPE (Sintel Clean)
Parameters~23M (encoder) + ~8M (flow estimator) + expander ≈ ~31M total trainable

4. Main Components of MC-JEPA

4.1 Shared Encoder (ConvNeXt-T)

The backbone of MC-JEPA is a ConvNeXt-Tiny (Liu et al., 2022) — a pure convolutional architecture that modernizes ResNets with design elements borrowed from transformers: depthwise separable convolutions, GELU activations, layer normalization, and inverted bottleneck blocks. The choice of a convolutional backbone (rather than a Vision Transformer) is deliberate: convolutional networks naturally produce multi-scale feature pyramids, which are essential for coarse-to-fine optical flow estimation.

Architecture details: Standard ConvNeXt-T has four stages with channel dimensions [96, 192, 384, 768] and block counts [3, 3, 9, 3]. Each stage halves the spatial resolution, producing a 4-level pyramid. However, standard ConvNeXt-T uses a "patchify" stem — a single 4×4 convolution with stride 4 — that immediately downsamples by 4×, limiting the finest-level resolution.

Modified stem: MC-JEPA replaces this aggressive stem with two sequential convolutions:

  1. 4×4 convolution with stride 2, padding 1 → LayerNorm (halves resolution, 3→48 channels)
  2. 3×3 convolution with stride 2, padding 1 → LayerNorm (halves again, 48→48 channels)

This splits the 4× downsampling into two 2× steps, producing an additional feature level. The result is a 6-level pyramid (instead of the standard 5 levels) with the finest level at 1/2 resolution of the input — bringing flow predictions closer to pixel space. This modification is critical for optical flow accuracy.

Additional settings: drop_path_rate = 0.1, layer_scale_init_value = 0.0.

4.2 Flow Estimator (Modified PWC-Net)

The flow estimator predicts dense optical flow between frame $I_t$ and frame $I_{t+1}$ using a coarse-to-fine strategy adapted from PWC-Net (Sun et al., 2018). It is a separate module (~8M parameters) that takes the encoder's pyramidal features as input.

Coarse-to-fine procedure: At the coarsest level ($l = 1$), flow is initialized to zero. At each subsequent level $l + 1$, the estimator:

  1. Upsamples the flow from level $l$: $f_{t,t+1}^{(l)} \rightarrow f_{t,t+1}^{(l) \uparrow}$
  2. Warps source features using the current flow: $\hat{X}_{t+1}^{(l+1)} = \text{warp}(X_t^{(l+1)}, f_{t,t+1}^{(l) \uparrow})$
  3. Computes a 4D correlation volume: $V^{(l+1)} = \hat{X}_{t+1}^{(l+1)} \cdot {X_{t+1}^{(l+1)}}^T$, measuring the similarity between warped source features and actual target features
  4. Predicts residual flow through a small CNN $g_\phi$: $\Delta f^{(l+1)} = g_\phi(V^{(l+1)}, f_{t,t+1}^{(l) \uparrow})$
  5. Updates: $f_{t,t+1}^{(l+1)} = f_{t,t+1}^{(l) \uparrow} + \Delta f^{(l+1)}$

This process produces progressively refined flow estimates from coarse global motion to fine local displacement.

Critical modifications from standard PWC-Net:

  • LayerNorm after each convolution (except the final layer) — the paper's ablation (Table 3) shows that without LayerNorm, training crashes entirely. The multi-task gradient dynamics require this normalization for stability.
  • Channel factor $C = 2$, doubling filter counts relative to the base estimator, yielding ~8M parameters (vs. ~2M at $C = 1$). The larger estimator marginally improves all metrics.
  • Flow clipping to $[-128, 128]$, tighter than M-JEPA's $[-256, 256]$ range, preventing extreme flow values during multi-task training.
Critical stability finding: The ablation in Table 3 reveals that removing LayerNorm from the flow estimator causes training to crash when combined with the content objective. L2-normalization of features is even worse, dropping mIoU from 67.1 to 53.2. This highlights the delicate gradient dynamics of multi-task training: normalization choices that are optional in single-task settings become essential when motion and content losses compete for the same encoder capacity.

4.3 Expander Network (VICReg Branch)

The content learning branch processes the final-stage output of the shared encoder (768-dimensional) through an expander network — a three-layer fully-connected network that maps to a higher-dimensional embedding space where the VICReg loss is applied:

$$\text{Expander}: \mathbb{R}^{768} \xrightarrow{\text{FC}} \mathbb{R}^{8192} \xrightarrow{\text{FC}} \mathbb{R}^{8192} \xrightarrow{\text{FC}} \mathbb{R}^{8192}$$

The expander serves the same role as in standard VICReg: projecting to a high-dimensional space where the variance and covariance regularization terms can operate effectively without distorting the encoder's learned representation. The expander is discarded after training — only the encoder features are used for downstream tasks.

4.4 Prediction Strategy — No Masking

Unlike I-JEPA, V-JEPA, and other masking-based JEPA variants, MC-JEPA does not mask any portion of its input. There are no target blocks, no context blocks, and no mask tokens. Instead, the architecture employs two distinct prediction mechanisms:

  1. Motion prediction: The flow estimator predicts how frame $t$'s features should be spatially transformed (warped) to align with frame $t+1$'s features. This is a prediction in representation space — the encoder must produce features where optical flow–based warping yields meaningful alignment. The prediction target is the actual feature map $X_{t+1}^{(l)}$ at each pyramid level.
  2. Content prediction: VICReg predicts that two augmented views of the same image should produce similar embeddings. The "prediction" is implicit: minimizing the invariance loss forces the encoder to be predictable across augmentations.

The connection to JEPA is most clear in the motion stream: the regression loss $\mathcal{L}_{reg} = \sum_{l=1}^{L} \|X_{t+1}^{(l)} - \hat{X}_{t+1}^{(l)}\|_2^2$ penalizes discrepancies between predicted and actual feature representations, which is precisely the JEPA paradigm of prediction in latent space rather than pixel space.

Coarse-to-Fine Flow Estimation (Prediction Strategy) Coarsest Level 1 Level l (intermediate) Finest Level 6 $f^{(1)} = 0$ init zero $X_t^{(1)}$ $X_{t+1}^{(1)}$ $F_\theta$ correlate+CNN $f^{(2)}$ upsample $f^{(l)\uparrow}$ upsampled flow Warp $\hat{X}_{t+1}^{(l)} =$ warp($X_t, f^{(l)\uparrow}$) 4D Corr Vol $V^{(l)} =$ $\hat{X}_{t+1} \cdot X_{t+1}^T$ $g_\phi$ (CNN) + LayerNorm → $\Delta f^{(l+1)}$ $f^{(l+1)} =$ $f^{(l)\uparrow} + \Delta f$ ... repeat ... $f^{(5)\uparrow}$ Final $F_\theta$ corr + residual Final Flow $f_{t,t+1}^{(6)}$ H/2 × W/2 × 2 clip to [-128, 128] Each level: upsample previous flow → warp source features → compute correlation → predict residual flow → add
Figure 2: Coarse-to-fine optical flow estimation within MC-JEPA. Starting from zero-initialized flow at the coarsest pyramid level, each subsequent level warps source features, computes a 4D correlation volume with target features, and predicts a residual flow correction. The flow is progressively refined over 6 pyramid levels.

4.5 Loss Functions

MC-JEPA combines five motion losses with the VICReg self-supervised loss. We define each precisely.

4.5.1 Feature Regression Loss ($\mathcal{L}_{reg}$)

The regression loss enforces that warped source features match actual target features at every pyramid level:

$$\mathcal{L}_{reg} = \sum_{l=1}^{L} w_l^{reg} \| X_{t+1}^{(l)} - \text{warp}(X_t^{(l)}, f_{t,t+1}^{(l)}) \|_2^2$$

where $X_t^{(l)} \in \mathbb{R}^{C_l \times H_l \times W_l}$ are the encoder's features at pyramid level $l$, $f_{t,t+1}^{(l)}$ is the estimated flow at level $l$, $\text{warp}(\cdot)$ applies bilinear warping, and $w_l^{reg}$ is the per-level weight. This is the JEPA-aligned loss: it penalizes discrepancies in representation space, not pixel space. Per-level weights decay from $w_l^{reg} = 1.0$ at the coarsest levels to 0.01 at the finest level (Table 8 in the paper).

4.5.2 Reconstruction Loss ($\mathcal{L}_{rec}$)

The reconstruction loss provides a pixel-level training signal by warping the entire source image using the estimated flow and comparing it to the actual target image:

$$\mathcal{L}_{rec} = d(I_{t+1}, \text{warp}(I_t, f_{t,t+1}))$$

where $d$ is a linear combination of L2 loss, L1 loss, and the structural similarity index (SSIM):

$$d(a, b) = \lambda_2 \|a - b\|_2^2 + \lambda_1 \|a - b\|_1 + \lambda_{\text{SSIM}} (1 - \text{SSIM}(a, b))$$

This pixel-level loss supplements the feature-level regression, providing direct photometric supervision for the flow field.

4.5.3 Edge-Aware Smoothness Loss ($\mathcal{L}_{smooth}$)

Smoothness regularization penalizes flow gradients in regions where the image gradient is small (homogeneous areas where flow should be smooth):

$$\mathcal{L}_{smooth} = \sum_{d \in \{x,y\}} \sum_p \exp\!\left(-\lambda \, |\nabla_d I(p)|\right) \cdot |\nabla_d f_{t,t+1}(p)|$$

where $\nabla_d$ is the spatial gradient in direction $d$, $p$ indexes spatial locations, and $\lambda = 75.0$ controls the edge-awareness. At image edges (high $|\nabla_d I|$), the exponential weight approaches zero, allowing flow discontinuities; in smooth regions, the weight is near 1, enforcing smooth flow.

4.5.4 Cycle Consistency Loss ($\mathcal{L}_{cycle}$)

Cycle consistency enforces that warping features forward then backward returns to the original location:

$$\mathcal{L}_{cycle} = 0.2 \sum_{l=1}^{L} \| X_t^{(l)} - \text{warp}(\text{warp}(X_t^{(l)}, f_{t,t+1}^{(l)}), f_{t+1,t}^{(l)}) \|_2^2$$

where $f_{t,t+1}$ is the forward flow and $f_{t+1,t}$ is the backward flow (estimated symmetrically). The coefficient of 0.2 weights this relative to other losses. A forward-backward compatibility mask handles occlusion regions where cycle consistency cannot hold.

4.5.5 Variance-Covariance Regularization ($\mathcal{L}_{vc}$)

This is the critical stabilization mechanism for multi-task training. Applied to pyramidal features at each level, it prevents feature collapse — the degenerate state where all spatial positions produce identical representations:

$$\mathcal{L}_{vc} = \sum_{l=1}^{L} \left[ \frac{\lambda_v^{(l)}}{d_l} \sum_{j=1}^{d_l} \max\!\left(0, \gamma - \sqrt{\text{Var}(X_{t,j}^{(l)}) + \epsilon}\right) + \frac{\lambda_c^{(l)}}{d_l} \sum_{i \neq j} [C(X_t^{(l)})]_{i,j}^2 \right]$$

where $d_l$ is the channel dimension at level $l$, $X_{t,j}^{(l)}$ is the $j$-th channel of the feature map, $\gamma$ is a target standard deviation threshold, $\epsilon$ is a small constant for numerical stability, $C(X_t^{(l)})$ is the covariance matrix of features, and $\lambda_v^{(l)}, \lambda_c^{(l)}$ are per-level weights. The variance term (hinge loss) ensures each channel maintains sufficient variation across the batch, while the covariance term decorrelates channels. Per-level weights $\lambda_v^{(l)}$ range from 0.01 (coarsest) to 0.0001 (finest), and $\lambda_c^{(l)}$ from 0.04 (coarsest) to 0.0 (finest).

Why $\mathcal{L}_{vc}$ is essential: Without variance-covariance regularization, MC-JEPA's image segmentation mIoU collapses from 67.1 to 47.3 and video segmentation (J&F)m drops from 70.5 to 37.8 (Table 11). The motion objective alone does not provide sufficient gradient diversity to prevent the encoder's intermediate features from collapsing — it primarily constrains the spatial structure of features, not their distributional properties.

4.5.6 VICReg Self-Supervised Loss ($\mathcal{L}_{ssl}$)

Applied to the 8192-dimensional outputs $Z_1, Z_2$ of the expander network for two augmented views:

$$\mathcal{L}_{ssl} = \lambda_{inv} \mathcal{L}_{inv} + \lambda_{var} \mathcal{L}_{var} + \lambda_{cov} \mathcal{L}_{cov}$$

where:

  • Invariance: $\mathcal{L}_{inv} = \frac{1}{B} \sum_{i=1}^{B} \|Z_1^{(i)} - Z_2^{(i)}\|_2^2$ — embeddings of the same image should be identical
  • Variance: $\mathcal{L}_{var} = \frac{1}{d} \sum_{j=1}^{d} \max(0, \gamma - \sqrt{\text{Var}(Z_j) + \epsilon})$ — each embedding dimension must maintain variance across the batch
  • Covariance: $\mathcal{L}_{cov} = \frac{1}{d} \sum_{i \neq j} [C(Z)]_{i,j}^2$ — off-diagonal covariance entries are penalized to decorrelate dimensions

Coefficients: $\lambda_{inv} = 1.0$, $\lambda_{var} = 25.0$, $\lambda_{cov} = 1.0$.

4.5.7 Combined MC-JEPA Loss

The total training objective combines all motion losses (scaled by $\alpha = 0.1$) with the content loss:

$$\mathcal{L}_{MC\text{-}JEPA} = \alpha \left(\mathcal{L}_{reg} + \mathcal{L}_{rec} + \mathcal{L}_{smooth} + \mathcal{L}_{cycle} + \mathcal{L}_{vc}\right) + \mathcal{L}_{ssl}$$

The multi-task coefficient $\alpha = 0.1$ is critical: the paper's ablation shows that higher values significantly degrade content learning performance.

4.6 Variance-Covariance Regularization — A Closer Look

The $\mathcal{L}_{vc}$ term deserves special attention because it is the component that makes multi-task training viable. The authors apply it per pyramid level with carefully tuned weights (Table 8), with a 1-epoch warmup period during which only flow losses are active before $\mathcal{L}_{vc}$ kicks in.

The ablation in Table 11 reveals the design space:

  • No regularization: ISeg drops to 47.3 mIoU, VSeg to 37.8 (J&F)m — catastrophic
  • Last layer only: ISeg recovers to 65.6, VSeg to 69.2 — helpful but suboptimal
  • All layers, no warmup: ISeg 66.2, VSeg 69.4 — slightly worse due to interference with flow initialization
  • All layers, 1-epoch warmup: ISeg 67.1, VSeg 70.5 — optimal
  • All layers, 2-epoch warmup: ISeg 62.5, VSeg 64.1 — too much delay degrades early training dynamics

5. Implementation Details

HyperparameterValueNotes
BackboneConvNeXt-Tiny~23M params, modified stem for 6 pyramid levels
Stem architecture4×4 conv (stride 2, pad 1) → LN → 3×3 conv (stride 2, pad 1) → LNReplaces standard 7×7 stride-4 patchify stem; 3→48→48 channels
Stage channels[96, 192, 384, 768]Standard ConvNeXt-T dimensions
Stage blocks[3, 3, 9, 3]Standard ConvNeXt-T block counts
Pyramid levels62 from stem + 4 from stages
drop_path_rate0.1Stochastic depth for regularization
layer_scale_init_value0.0Initialized to zero
Flow estimator filter factorC = 2Doubles channel count → ~8M params
Flow estimator LayerNormAfter every conv layer except finalCritical for multi-task stability
Flow clip range[-128, 128]Tighter than M-JEPA's [-256, 256]
Expander network768 → 8192 → 8192 → 81923-layer FC, applied to final encoder output
Optimizer & Schedule
OptimizerAdamW$\beta_1 = 0.9$, $\beta_2 = 0.999$
Encoder learning rate$3 \times 10^{-4}$For the shared ConvNeXt-T backbone
Flow estimator learning rate$1 \times 10^{-4}$Lower LR for flow head
End learning rate$3 \times 10^{-8}$Cosine decay target
Weight decay$1 \times 10^{-6}$Applied uniformly
LR scheduleCosine decayWith linear warmup
Warmup epochs10ImageNet-only (VICReg) for first 10 epochs
Total epochs100Phase 1: epochs 0–9; Phase 2: epochs 10–100
SSL batch size384ImageNet augmented pairs
Flow batch size8Video frame pairs
Data & Augmentation
Content dataImageNet-1KRandom resized crop (scale 0.08–1.0) to 224×224, color jitter
Flow training dataMixed: FlyingThings, FlyingChairs, KITTI raw/train, Sintel raw/train, HD1kWith dataset-specific repetition factors (Table 7)
Flow data resolutionSintel: 384×832, KITTI: 256×832, FlyingThings: 384×512Variable resolution per dataset
Loss Coefficients
Multi-task coefficient ($\alpha$)0.1Scales all motion losses relative to VICReg
Cycle consistency coefficient0.2Weights $\mathcal{L}_{cycle}$ within motion losses
Smoothness factor ($\lambda$)75.0Edge-awareness exponential scaling
VC regularization warmup1 epoch$\mathcal{L}_{vc}$ starts after 1 epoch of training
VICReg $\lambda_{var}$25.0Variance coefficient on expander output
VICReg $\lambda_{cov}$1.0Covariance coefficient on expander output
VICReg $\lambda_{inv}$1.0Invariance coefficient on expander output
Hardware
GPUs8× NVIDIA V100 (32 GB)

The flow training data uses carefully tuned repetition factors to balance the diverse datasets (Table 7 in the paper): small datasets like KITTI 2012 train (200 samples) and KITTI 2015 train (200 samples) are repeated 100× per epoch, while larger datasets like FlyingThings (40,302 samples) and KITTI raw (42,382 samples) are used at 1× repetition.

No public implementation of MC-JEPA is available. All hyperparameters are sourced from the paper. The related V-JEPA repository (github.com/facebookresearch/jepa) does not include MC-JEPA code.

6. Algorithm

Algorithm 1: MC-JEPA Training (One Iteration)
Input: Video frame pair $(I_t, I_{t+1})$ from dataset $\mathcal{D}_1$; ImageNet image $I$ from dataset $\mathcal{D}_2$
Input: Shared encoder $f_\theta$, flow estimator $F_\phi$, expander $E_\psi$
Input: Multi-task coefficient $\alpha = 0.1$, number of pyramid levels $L = 6$
Output: Updated parameters $\theta, \phi, \psi$
 
1 // Motion stream
2 $\{X_t^{(l)}\}_{l=1}^{L}, \{X_{t+1}^{(l)}\}_{l=1}^{L} \leftarrow f_\theta(I_t), f_\theta(I_{t+1})$ // encode both frames
3 $f_{t,t+1} \leftarrow F_\phi(\{X_t^{(l)}\}, \{X_{t+1}^{(l)}\})$ // coarse-to-fine forward flow
4 $f_{t+1,t} \leftarrow F_\phi(\{X_{t+1}^{(l)}\}, \{X_t^{(l)}\})$ // coarse-to-fine backward flow
5 Compute $\mathcal{L}_{reg} = \sum_{l=1}^{L} w_l \| X_{t+1}^{(l)} - \text{warp}(X_t^{(l)}, f_{t,t+1}^{(l)}) \|_2^2$
6 Compute $\mathcal{L}_{rec} = d(I_{t+1}, \text{warp}(I_t, f_{t,t+1}))$
7 Compute $\mathcal{L}_{smooth} = \sum_{d,p} \exp(-\lambda|\nabla_d I|) \cdot |\nabla_d f_{t,t+1}|$
8 Compute $\mathcal{L}_{cycle} = 0.2 \sum_l \| X_t^{(l)} - \text{warp}(\text{warp}(X_t^{(l)}, f_{t,t+1}^{(l)}), f_{t+1,t}^{(l)}) \|_2^2$
9 Compute $\mathcal{L}_{vc}$ per-level variance-covariance regularization
10 $\mathcal{L}_{motion} \leftarrow \mathcal{L}_{reg} + \mathcal{L}_{rec} + \mathcal{L}_{smooth} + \mathcal{L}_{cycle} + \mathcal{L}_{vc}$
 
11 // Content stream
12 $\tilde{I}_1, \tilde{I}_2 \leftarrow \text{aug}(I), \text{aug}'(I)$ // two augmented views
13 $h_1, h_2 \leftarrow f_\theta(\tilde{I}_1), f_\theta(\tilde{I}_2)$ // encode with shared encoder
14 $Z_1, Z_2 \leftarrow E_\psi(h_1), E_\psi(h_2)$ // project to 8192-dim
15 Compute $\mathcal{L}_{ssl} = \lambda_{inv}\mathcal{L}_{inv}(Z_1, Z_2) + \lambda_{var}\mathcal{L}_{var}(Z_1, Z_2) + \lambda_{cov}\mathcal{L}_{cov}(Z_1, Z_2)$
 
16 // Combined loss and update
17 $\mathcal{L} \leftarrow \alpha \cdot \mathcal{L}_{motion} + \mathcal{L}_{ssl}$
18 $\theta, \phi, \psi \leftarrow \text{AdamW}(\nabla_{\theta,\phi,\psi} \mathcal{L})$
Algorithm 2: Coarse-to-Fine Flow Estimation
Input: Source features $\{X_t^{(l)}\}_{l=1}^L$, target features $\{X_{t+1}^{(l)}\}_{l=1}^L$, pyramid levels $L=6$
Output: Multi-scale flow $\{f_{t,t+1}^{(l)}\}_{l=2}^{L}$
 
1 $f_{t,t+1}^{(1)} \leftarrow \mathbf{0}$ // initialize flow at coarsest level to zero
2 for $l = 1$ to $L - 1$ do
3 $f^{\uparrow} \leftarrow \text{upsample}(f_{t,t+1}^{(l)})$ // bilinear upsample to level $l+1$ resolution
4 $\hat{X}_{t+1}^{(l+1)} \leftarrow \text{warp}(X_t^{(l+1)}, f^{\uparrow})$ // warp source features
5 $V^{(l+1)} \leftarrow \hat{X}_{t+1}^{(l+1)} \cdot {X_{t+1}^{(l+1)}}^T$ // 4D correlation volume
6 $\Delta f^{(l+1)} \leftarrow g_\phi(V^{(l+1)}, f^{\uparrow})$ // predict residual flow (CNN + LayerNorm)
7 $f_{t,t+1}^{(l+1)} \leftarrow \text{clip}(f^{\uparrow} + \Delta f^{(l+1)}, -128, 128)$ // update and clip
8 end for
9 return $\{f_{t,t+1}^{(l)}\}_{l=2}^{L}$
Algorithm 3: MC-JEPA Full Training Loop
Input: ImageNet dataset $\mathcal{D}_2$, flow datasets $\mathcal{D}_1$, total epochs $T=100$, warmup $T_w=10$
Input: VC warmup $T_{vc}=1$, multi-task coefficient $\alpha=0.1$
 
1 Initialize $f_\theta$ (ConvNeXt-T), $F_\phi$ (flow estimator), $E_\psi$ (expander)
2 for $epoch = 0$ to $T - 1$ do
3 Update learning rate via cosine schedule with warmup
4 for each batch do
5 // Phase 1: content only (epochs 0 to $T_w - 1$)
6 Sample ImageNet batch $\{I\}$ of size 384
7 Compute $\mathcal{L}_{ssl}$ (VICReg on augmented views)
8 if $epoch \geq T_w$ then
9 // Phase 2: joint motion + content
10 Sample video pair $(I_t, I_{t+1})$ from $\mathcal{D}_1$ of size 8
11 Compute $\mathcal{L}_{motion}$ via Algorithm 1, lines 2–10
12 if $epoch < T_w + T_{vc}$ then disable $\mathcal{L}_{vc}$ in $\mathcal{L}_{motion}$
13 $\mathcal{L} \leftarrow \alpha \cdot \mathcal{L}_{motion} + \mathcal{L}_{ssl}$
14 else
15 $\mathcal{L} \leftarrow \mathcal{L}_{ssl}$
16 end if
17 $\theta, \phi, \psi \leftarrow \text{AdamW}(\nabla \mathcal{L})$ // backprop through shared encoder
18 end for
19 end for

6.1 Reference Implementation (Python-like Pseudocode)

The following provides a self-contained reference for the core MC-JEPA forward pass and loss computation. No official repository is available; this pseudocode is reconstructed from the paper's architectural description and hyperparameters.

import torch
import torch.nn.functional as F

def coarse_to_fine_flow(encoder_feats_src, encoder_feats_tgt, flow_estimator, L=6, clip=128.0):
    """Algorithm 2: Coarse-to-fine optical flow estimation.

    Args:
        encoder_feats_src: list of L feature maps {X_t^(l)}, coarsest first
        encoder_feats_tgt: list of L feature maps {X_{t+1}^(l)}, coarsest first
        flow_estimator: g_phi CNN that predicts residual flow from correlation volume
        L: number of pyramid levels (6 for MC-JEPA)
        clip: flow clipping range (128.0 for MC-JEPA, 256.0 for M-JEPA)

    Returns:
        flows: list of per-level flow fields {f^(l)}_{l=2}^{L}
    """
    B, C0, H0, W0 = encoder_feats_src[0].shape
    flow = torch.zeros(B, 2, H0, W0, device=encoder_feats_src[0].device)  # f^(1) = 0
    flows = []

    for l in range(L - 1):
        # Upsample flow to next (finer) level
        H_next, W_next = encoder_feats_src[l + 1].shape[2:]
        flow_up = F.interpolate(flow, size=(H_next, W_next), mode='bilinear', align_corners=False)
        flow_up[:, 0] *= W_next / flow.shape[3]  # scale x-component
        flow_up[:, 1] *= H_next / flow.shape[2]  # scale y-component

        # Warp source features using current flow estimate
        X_src = encoder_feats_src[l + 1]                          # X_t^(l+1): B×C×H×W
        X_tgt = encoder_feats_tgt[l + 1]                          # X_{t+1}^(l+1): B×C×H×W
        X_warped = bilinear_warp(X_src, flow_up)                   # hat{X}_{t+1}^(l+1)

        # 4D correlation volume: V = hat{X}_{t+1} · X_{t+1}^T
        V = compute_correlation_volume(X_warped, X_tgt)            # B×(search_range^2)×H×W

        # Predict residual flow via CNN with LayerNorm
        delta_flow = flow_estimator[l](V, flow_up, X_src, X_warped)  # g_phi → B×2×H×W

        # Update flow with residual and clip
        flow = torch.clamp(flow_up + delta_flow, -clip, clip)     # f^(l+1) = clip(f_up + Δf)
        flows.append(flow)

    return flows  # flows[0] is f^(2), ..., flows[-1] is f^(L)


def mc_jepa_training_step(
    encoder,          # f_theta: shared ConvNeXt-T, 23M params
    flow_estimator,   # F_phi: PWC-Net based, 8M params
    expander,         # E_psi: 768→8192→8192→8192
    video_pair,       # (I_t, I_{t+1}): B_flow × 3 × H × W
    imagenet_views,   # (view1, view2): B_ssl × 3 × 224 × 224
    alpha=0.1,        # multi-task coefficient
    epoch=50,         # current epoch
    flow_start=10,    # epoch to begin flow training
    vc_warmup=1,      # epochs of VC warmup after flow_start
):
    """Algorithm 1+3: MC-JEPA combined training step."""
    loss = torch.tensor(0.0)

    # ── Content stream (always active) ──
    view1, view2 = imagenet_views
    h1 = encoder(view1)[-1].mean(dim=[2, 3])  # global avg pool → B×768
    h2 = encoder(view2)[-1].mean(dim=[2, 3])
    z1, z2 = expander(h1), expander(h2)       # B×8192

    L_inv = F.mse_loss(z1, z2)                                    # invariance: ||Z1 - Z2||^2
    L_var = vicreg_variance_loss(z1, z2, gamma=1.0, eps=1e-4)     # hinge on std
    L_cov = vicreg_covariance_loss(z1, z2)                        # off-diagonal penalty
    L_ssl = 1.0 * L_inv + 25.0 * L_var + 1.0 * L_cov
    loss = loss + L_ssl

    # ── Motion stream (active after flow_start epoch) ──
    if epoch >= flow_start:
        I_t, I_t1 = video_pair
        feats_t  = encoder(I_t)    # list of 6 feature maps, coarsest to finest
        feats_t1 = encoder(I_t1)

        # Forward and backward flow (Algorithm 2)
        flows_fwd = coarse_to_fine_flow(feats_t, feats_t1, flow_estimator)
        flows_bwd = coarse_to_fine_flow(feats_t1, feats_t, flow_estimator)

        # Per-level feature regression: L_reg = Σ_l w_l ||X_{t+1} - warp(X_t, f)||^2
        w_reg = [1.0, 1.0, 1.0, 1.0, 0.1, 0.01]  # per-level weights
        L_reg = sum(w * F.mse_loss(feats_t1[l+1], bilinear_warp(feats_t[l+1], f))
                     for l, (f, w) in enumerate(zip(flows_fwd, w_reg)))

        # Pixel reconstruction: L_rec = d(I_{t+1}, warp(I_t, f_final))
        I_warped = bilinear_warp(I_t, flows_fwd[-1])
        L_rec = photometric_loss(I_t1, I_warped)  # L2 + L1 + SSIM

        # Edge-aware smoothness: exp(-λ|∇I|) · |∇f|
        L_smooth = edge_aware_smoothness(flows_fwd[-1], I_t, lam=75.0)

        # Cycle consistency: ||X_t - warp(warp(X_t, f_fwd), f_bwd)||^2
        L_cycle = 0.2 * sum(
            F.mse_loss(feats_t[l+1],
                       bilinear_warp(bilinear_warp(feats_t[l+1], ff), fb))
            for l, (ff, fb) in enumerate(zip(flows_fwd, flows_bwd))
        )

        L_motion = L_reg + L_rec + L_smooth + L_cycle

        # Variance-covariance regularization (after VC warmup)
        if epoch >= flow_start + vc_warmup:
            L_vc = sum(
                vc_regularization(feats_t[l], lam_v=lv, lam_c=lc)
                for l, (lv, lc) in enumerate(
                    zip([0.01, 0.01, 0.01, 0.01, 0.001, 0.0001],
                        [0.04, 0.04, 0.001, 0.0, 0.0, 0.0]))
            )
            L_motion = L_motion + L_vc

        loss = loss + alpha * L_motion

    return loss

7. Training

Training MC-JEPA proceeds in two phases with distinct objectives.

7.1 Phase 1: Content Pre-training (Epochs 0–9)

During the first 10 epochs, only the VICReg content objective is active. The shared ConvNeXt-T encoder and expander network are trained on ImageNet-1K:

  1. Sample an ImageNet image $I$ from the batch (size 384)
  2. Generate two augmented views: random resized crop (scale 0.08–1.0, output 224×224), horizontal flip, color jitter
  3. Encode both views through the shared ConvNeXt-T to get 768-dim representations
  4. Pass through the 3-layer expander to get 8192-dim embeddings $Z_1, Z_2$
  5. Compute VICReg loss $\mathcal{L}_{ssl}$ and update $\theta$ (encoder) and $\psi$ (expander)

This phase establishes a semantically-meaningful feature space before introducing the motion objective, preventing the flow gradients from dominating the early, fragile learning of content features.

7.2 Phase 2: Joint Motion-Content Training (Epochs 10–100)

Starting at epoch 10, both streams operate simultaneously at each training iteration:

  1. Content step: Sample an ImageNet batch, compute $\mathcal{L}_{ssl}$ as above
  2. Motion step: Sample a video pair batch (size 8) from the mixed flow dataset
    • Encode both frames through the shared encoder → 6-level pyramidal features
    • Run coarse-to-fine flow estimation (Algorithm 2) for forward and backward flows
    • Compute $\mathcal{L}_{reg}$, $\mathcal{L}_{rec}$, $\mathcal{L}_{smooth}$, $\mathcal{L}_{cycle}$
    • After epoch 11: also compute $\mathcal{L}_{vc}$ (variance-covariance regularization)
  3. Combine: $\mathcal{L} = \alpha \cdot \mathcal{L}_{motion} + \mathcal{L}_{ssl}$ with $\alpha = 0.1$
  4. Update: Single AdamW step updates the shared encoder $\theta$, flow estimator $\phi$, and expander $\psi$ jointly

The paper's ablation (Table 5) shows that batch alternation (computing both losses at each iteration) and combined loss (summing losses before backprop) both work well, while epoch alternation (alternating which objective is active per epoch) is significantly worse. The combined loss approach yields the best optical flow metrics.

MC-JEPA Training Pipeline — One Iteration (Phase 2) DATA Video Pair $(I_t, I_{t+1})$ Sintel/KITTI/FT batch=8 ImageNet $I$ aug → 224×224 batch=384 Shared ConvNeXt-T $f_\theta$ 23M params 6 pyramid lvls ∇ trainable Pyramid $\{X^{(l)}\}_{l=1}^6$ 96→768 ch Flow Est. $F_\phi$ (PWC) 8M params ∇ Flow $f_{t,t+1}$ H/2×W/2×2 $\mathcal{L}_{motion}$ $\mathcal{L}_{reg}$: feature warp L2 $\mathcal{L}_{rec}$: pixel reconstruct $\mathcal{L}_{smooth}$: edge smooth $\mathcal{L}_{cycle}$: fwd/bwd $\mathcal{L}_{vc}$: var-cov reg × $\alpha = 0.1$ Features $h_1, h_2$ (768d) Expander $E_\psi$ → 8192d $Z_1, Z_2$ B×8192 $\mathcal{L}_{ssl}$ (VICReg) inv=1.0 + var=25.0 + cov=1.0 $\mathcal{L}_{total}$ $0.1 \mathcal{L}_{motion} + \mathcal{L}_{ssl}$ ∇ backprop to $\theta, \phi, \psi$ Training Schedule Phase 1 Phase 2: Joint Motion + Content epoch 0 10 +VC@11 100
Figure 3: Detailed training pipeline for MC-JEPA Phase 2 (epochs 10–100). At each iteration, a video batch feeds the motion stream and an ImageNet batch feeds the content stream. Both paths share the ConvNeXt-T encoder. The total loss, combining motion ($\times 0.1$) and VICReg losses, is backpropagated through all trainable parameters in a single optimizer step.

7.3 Training Objective — Mathematical Formulation

The complete training objective at epoch $e \geq T_w + T_{vc}$ (after all warmup phases) is:

$$\mathcal{L} = \alpha \underbrace{\left[\sum_{l=1}^{L} w_l^{reg} \|X_{t+1}^{(l)} - \hat{X}_{t+1}^{(l)}\|_2^2 + d(I_{t+1}, \hat{I}_{t+1}) + \mathcal{L}_{smooth} + 0.2 \sum_l \|X_t^{(l)} - \hat{X}_{t}^{(l, \text{cycle})}\|_2^2 + \mathcal{L}_{vc}\right]}_{\text{motion losses over } \mathcal{D}_1}$$ $$ + \underbrace{\lambda_{inv}\|Z_1 - Z_2\|_2^2 + \lambda_{var}\mathcal{L}_{var}(Z_1, Z_2) + \lambda_{cov}\mathcal{L}_{cov}(Z_1, Z_2)}_{\text{VICReg loss over } \mathcal{D}_2}$$

where $\hat{X}_{t+1}^{(l)} = \text{warp}(X_t^{(l)}, f_{t,t+1}^{(l)})$, $\hat{I}_{t+1} = \text{warp}(I_t, f_{t,t+1})$, and $\hat{X}_t^{(l, \text{cycle})} = \text{warp}(\hat{X}_{t+1}^{(l)}, f_{t+1,t}^{(l)})$.

8. Inference

After training, the shared ConvNeXt-T encoder is used for downstream tasks. The flow estimator and expander network are discarded (unless optical flow estimation is the downstream task).

8.1 Optical Flow Estimation

For optical flow prediction on new video:

  1. Feed two consecutive frames $I_t, I_{t+1}$ through the trained ConvNeXt-T encoder to get 6-level pyramidal features
  2. Run the coarse-to-fine flow estimator (Algorithm 2) to produce the dense flow field $f_{t,t+1}$
  3. Upsample the final flow (from H/2 × W/2) to full resolution H × W

8.2 Image Segmentation

For image understanding tasks (Pascal VOC, Cityscapes, ADE20k):

  1. Feed the image through the ConvNeXt-T encoder
  2. Extract the multi-scale pyramidal features
  3. Attach a segmentation head (task-specific, trained from scratch)
  4. Two evaluation modes: frozen (encoder weights fixed, only head trains) and fine-tuned (entire model updates)

8.3 Inference Code (Python-like Pseudocode)

def mc_jepa_flow_inference(encoder, flow_estimator, I_t, I_t1):
    """Optical flow inference: encoder + flow estimator, no expander."""
    with torch.no_grad():
        feats_t  = encoder(I_t)    # 6-level pyramid
        feats_t1 = encoder(I_t1)
        flows = coarse_to_fine_flow(feats_t, feats_t1, flow_estimator)
        # Upsample final flow from H/2×W/2 to full resolution
        flow_final = F.interpolate(flows[-1], scale_factor=2, mode='bilinear')
        flow_final *= 2.0  # scale displacement magnitudes
    return flow_final  # H×W×2


def mc_jepa_feature_extraction(encoder, image):
    """Feature extraction for segmentation: encoder only, discard flow head."""
    with torch.no_grad():
        pyramid = encoder(image)       # list of 6 feature maps
        global_feat = pyramid[-1]      # finest semantic level: B×768×H/32×W/32
        local_feats = pyramid[2:5]     # mid-level features for dense prediction
    return global_feat, local_feats

8.4 Video Object Segmentation

For DAVIS 2017 video segmentation, the trained encoder features are used for label propagation — propagating object masks from the first frame to subsequent frames based on feature similarity.

MC-JEPA Inference Pipelines A. Optical Flow Estimation $I_t, I_{t+1}$ frame pair ConvNeXt-T frozen encoder 6-lvl Flow Est. coarse→fine $f_{t,t+1}$ H×W×2 EPE eval Sintel/KITTI B. Image Segmentation (Frozen or Fine-tuned) Image H×W×3 ConvNeXt-T frozen or ∇ 768d Segm. Head task-specific mIoU VOC/City/ADE C. Video Object Segmentation Video T frames ConvNeXt-T frozen encoder Label Prop. feature sim. (J&F)m DAVIS 2017
Figure 4: MC-JEPA inference pipelines. (A) Optical flow estimation retains both the encoder and flow estimator. (B) Image segmentation uses only the encoder with a task-specific head. (C) Video object segmentation uses frozen encoder features for label propagation.

9. Results & Benchmarks

9.1 Optical Flow Estimation

Optical flow is evaluated using End-Point Error (EPE) — the L2 distance between predicted and ground-truth flow vectors, averaged over all pixels. Lower is better.

MethodBackboneSintel Clean (train)Sintel Final (train)KITTI 2015 (train)
UFlowPWC-Net2.503.392.71
ARFlowPWC-Net2.793.732.85
UPFlowPWC-Net2.332.672.45
SMURFRAFT1.712.582.00
M-JEPA (motion only)ConvNeXt-T2.983.823.01
MC-JEPAConvNeXt-T2.813.512.67

MC-JEPA improves over the motion-only baseline (M-JEPA) across all benchmarks, confirming that content learning benefits optical flow estimation. EPE drops from 2.98 to 2.81 on Sintel Clean (5.7% improvement), from 3.82 to 3.51 on Sintel Final (8.1% improvement), and from 3.01 to 2.67 on KITTI 2015 (11.3% improvement). MC-JEPA is competitive with PWC-based unsupervised methods (UFlow, ARFlow, UPFlow) but does not match SMURF, which uses the more powerful RAFT backbone.

9.2 Image Segmentation

Semantic segmentation is evaluated using mean Intersection over Union (mIoU) in both frozen and fine-tuned settings.

MethodBackbonePascal VOC (frz / ft)Cityscapes (frz / ft)ADE20k (frz / ft)
MoCo v3ViT-S57.1 / 75.956.5 / 74.023.7 / 39.8
VICRegConvNeXt-T60.1 / 77.859.8 / 76.328.6 / 41.1
VICRegLConvNeXt-T66.8 / 79.764.9 / 78.330.6 / 44.1
DINOViT-S65.2 / 79.564.8 / 78.130.5 / 43.5
MC-JEPAConvNeXt-T67.1 / 79.965.5 / 78.430.8 / 44.2

MC-JEPA achieves the best frozen-feature performance across all three benchmarks, outperforming VICRegL (its content-learning component without motion) by +0.3 mIoU on Pascal VOC, +0.6 on Cityscapes, and +0.2 on ADE20k in the frozen setting. It also surpasses DINO by +1.9 on Pascal VOC, +0.7 on Cityscapes, and +0.3 on ADE20k. Fine-tuned results are competitive but gains are smaller, as expected when all weights are updated.

9.3 Video Object Segmentation

Method(J&F)mJmFm
MCRW57.9
VICReg58.156.459.8
VFS68.9
VICRegL66.764.568.9
DINO69.966.673.1
MC-JEPA70.567.074.0

The DAVIS 2017 video object segmentation results are particularly striking. MC-JEPA achieves 70.5 (J&F)m, surpassing DINO's 69.9 by 0.6 points. This is the strongest validation of the multi-task hypothesis: motion-aware features produce better temporal correspondence for video understanding, exactly as one would expect when the encoder has been trained to reason about how objects move.

9.4 Ablation Studies

Multi-Task Benefit: M-JEPA vs MC-JEPA

MethodKITTI EPE ↓Sintel Clean EPE ↓Pascal VOC mIoU ↑DAVIS (J&F)m
M-JEPA (motion only)3.012.98
VICReg (content only)60.158.1
VICRegL (content + local)66.866.7
MC-JEPA (both)2.672.8167.170.5

Joint training improves both objectives: optical flow improves (3.01 → 2.67 KITTI EPE) and content features improve (66.8 → 67.1 VOC mIoU). The largest gains appear in video segmentation (66.7 → 70.5 DAVIS), where motion understanding directly aids temporal reasoning.

Data Sampling Strategy (Table 5)

StrategyKITTI EPE ↓Sintel Clean EPE ↓ISeg mIoU ↑VSeg (J&F)m
Flow estimator only (separate training)13.5213.8260.165.2
Flow estimator fine-tuning2.712.8261.362.3
Epoch alternation4.544.9163.566.9
Batch alternation2.782.9567.170.5
Combined loss2.672.8167.170.5

The combined loss approach (computing both losses in a single forward pass and summing) yields the best flow results while matching batch alternation on segmentation. Epoch alternation is substantially worse (4.54 vs 2.67 KITTI EPE), and training the flow estimator separately without shared-encoder gradients yields catastrophically poor flow (13.52 EPE).

Backbone Architecture (Table 4)

BackboneParamsKITTI EPE ↓Sintel Clean EPE ↓ISeg mIoU ↑VSeg (J&F)m
PWC-Net8M2.662.8014.810.1
ResNet-5021M2.712.8555.860.1
ConvNeXt-T23M2.672.8167.170.5

PWC-Net achieves marginally better flow (2.66 vs 2.67 KITTI EPE) but produces useless downstream features (14.8 mIoU). ConvNeXt-T provides an excellent balance: competitive flow estimation and strong semantic features. The 11.3-point mIoU gap over ResNet-50 (67.1 vs 55.8) justifies the modern convolutional architecture choice.

Flow Estimator Architecture (Table 3)

FactorParamsLayerNormKITTI EPE ↓ISeg mIoU ↑VSeg ↑
C=12MNoCRASHED
C=12MYes2.6867.070.2
C=12MYes + L2-Norm6.2153.247.9
C=28MYes2.6767.170.5

Without LayerNorm, multi-task training is completely unstable and crashes. L2-normalization of features is catastrophic, degrading all metrics. The larger estimator (C=2, 8M params) provides marginal improvements over C=1 (2M params).

10. Connection to the JEPA Family

MC-JEPA occupies a unique position in the JEPA family tree. While it shares the name and the core principle of prediction in representation space, its architectural realization diverges significantly from masking-based JEPA variants.

10.1 What MC-JEPA Borrows from the JEPA Framework

  • Latent-space prediction: The feature regression loss $\mathcal{L}_{reg}$ compares encoder representations, not raw pixels. The flow estimator predicts how features should be warped — this is prediction in embedding space, aligned with LeCun's (2022) JEPA vision.
  • Self-supervised objective: No labels are required. Both motion (optical flow) and content (VICReg) objectives are self-supervised.
  • Shared encoder: A single encoder produces representations useful for multiple downstream tasks, consistent with JEPA's goal of learning general-purpose world models.

10.2 How MC-JEPA Differs from I-JEPA and V-JEPA

FeatureI-JEPAV-JEPAMC-JEPA
InputSingle imageVideo clipVideo pairs + images
BackboneViTViTConvNeXt-T (convolutional)
MaskingMulti-block spatialSpatiotemporal tubesNone
Target encoderEMAEMANone
PredictorNarrow ViTNarrow ViTPWC-Net flow estimator
Prediction taskPredict masked patch repr.Predict masked tube repr.Predict feature warping (flow)
Content lossL2 in repr. spaceL2 in repr. spaceVICReg (separate objective)
Collapse preventionEMA + narrow predictorEMA + narrow predictorVICReg variance-covariance
Parameters~632M (ViT-H)~300M+ (ViT-L)~31M (ConvNeXt-T)

The most fundamental difference is the absence of masking and EMA. I-JEPA and V-JEPA prevent representational collapse through the combination of an EMA target encoder, stop-gradient, and a narrow predictor bottleneck. MC-JEPA instead relies on explicit regularization (VICReg + variance-covariance terms) to prevent collapse. This is a fundamentally different collapse-prevention philosophy: structural asymmetry (I-JEPA) vs. loss-based regularization (MC-JEPA).

10.3 What MC-JEPA Contributed to the JEPA Lineage

Main novel contribution: MC-JEPA demonstrated that the JEPA framework can accommodate multi-task learning with heterogeneous objectives — motion estimation and content understanding — within a shared encoder, and that these objectives are mutually beneficial. It showed that JEPA-style latent prediction is not limited to masking-based approaches: coarse-to-fine optical flow estimation in feature space constitutes a valid form of joint-embedding prediction. The discovery that motion-content co-training produces the best frozen-feature segmentation results (outperforming DINO) provided strong evidence for the value of motion-aware representation learning.

10.4 Relationship to Later Work

MC-JEPA was published concurrently with the development of V-JEPA (Bardes et al., 2024), which shares co-author Adrien Bardes. While V-JEPA returned to the masking-based JEPA paradigm with ViT backbones applied to video, MC-JEPA's insights about motion-content disentanglement and multi-task learning in shared encoders informed the broader understanding of how video JEPA systems should capture temporal dynamics. The hierarchical coarse-to-fine structure of MC-JEPA's flow estimation has also been noted as a precursor to hierarchical JEPA (H-JEPA) thinking, where predictions are made at multiple abstraction levels.

11. Summary

Key takeaway: MC-JEPA demonstrates that self-supervised motion estimation (optical flow) and content learning (VICReg) are mutually beneficial when trained jointly in a shared convolutional encoder, producing representations that outperform single-task approaches on both optical flow benchmarks and semantic/video segmentation tasks.

Main contribution: A dual-objective architecture that unifies optical flow estimation and self-supervised content learning within a shared ConvNeXt-T backbone, achieving state-of-the-art frozen-feature segmentation (67.1 mIoU on Pascal VOC, 70.5 (J&F)m on DAVIS 2017) while maintaining competitive unsupervised flow estimation (2.81 EPE on Sintel Clean). The critical enabling discovery is that variance-covariance regularization on pyramidal features, combined with LayerNorm in the flow estimator, stabilizes the otherwise fragile multi-task gradient dynamics.

When to use MC-JEPA vs alternatives:

  • Choose MC-JEPA when you need both motion and content understanding from a single compact model (~31M params), especially for video understanding tasks where temporal correspondence matters.
  • Choose I-JEPA/V-JEPA when you need maximum performance on image/video classification with larger models and can afford the ViT-scale parameter count.
  • Choose dedicated flow methods (RAFT, SMURF) when optical flow accuracy is the sole requirement and semantic understanding is not needed.
  • Choose VICRegL when only image segmentation is needed and you do not require motion features.

12. References

  1. Bardes, A., Ponce, J., and LeCun, Y. (2023). MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features. arXiv:2307.12698.
  2. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Open Review, version 0.9.2.
  3. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., and Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
  4. Bardes, A., Ponce, J., and LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
  5. Bardes, A., Ponce, J., and LeCun, Y. (2022). VICRegL: Self-Supervised Learning of Local Visual Features. NeurIPS 2022.
  6. Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., and Ballas, N. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv:2404.08471 (V-JEPA).
  7. Sun, D., Yang, X., Liu, M.-Y., and Kautz, J. (2018). PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. CVPR 2018.
  8. Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022). A ConvNet for the 2020s. CVPR 2022.
  9. Stone, A., Maurer, D., Aber, A., Erber, A., and Fleet, D. (2021). SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping. CVPR 2021.
  10. Luo, K., Wang, C., Liu, S., Fan, H., Wang, J., and Sun, J. (2021). UPFlow: Upsampling Pyramid for Unsupervised Optical Flow Learning. CVPR 2021.
  11. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021.
  12. Chen, X. and He, K. (2021). Exploring Simple Siamese Representation Learning (SimSiam). CVPR 2021.