At a glance
ProblemPretraining strong 3D encoders is expensive and data-hungry, while large 2D image encoders are abundant and information-rich.
Key ideaLearn 3D representations cross-modally — predict 3D structure latents from 2D image views in a shared embedding space.
ModalityCross-modal 2D images to 3D point clouds
Target / maskingCross-modal correspondence: 2D-view context predicts the latent of the corresponding 3D region.
Builds onJEPA latent prediction; large 2D image encoders.
Used forEfficient 3D representation learning via 2D-to-3D transfer.

Motivation

Training strong 3D encoders from scratch is expensive and data-hungry: 3D datasets are smaller and costlier to acquire than image collections. Meanwhile, large 2D image encoders are abundant, cheap to obtain, and already information-rich. CrossJEPA asks whether 3D representations can be learned efficiently by transferring knowledge from the 2D modality, rather than by costly native 3D pretraining — turning plentiful 2D supervision into a 3D learning signal through latent prediction.

How it works

2D images + point cloudpoint patchs · cross-modalContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copylocal loss (e.g. MLM)
Canonical JEPA schematic for 2D images + point cloud. The input is split into a visible context and hidden targets (point patch-level, cross-modal). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

CrossJEPA applies the joint-embedding predictive objective across modalities.

  • A context encoder $f_\theta$ processes 2D images — renders or views of a 3D object or scene.
  • A predictor $g_\phi$ maps the resulting 2D-derived latents to predict the embeddings of the corresponding 3D structure.
  • A target encoder $\bar f_\theta$ operates on the point cloud / 3D representation to produce those targets, with gradients stopped.

The cross-modal correspondence — which 2D view aligns with which 3D region — defines the prediction target. The 3D-side representation is supervised entirely by prediction from the 2D view, with no pixel or point reconstruction, so the rich 2D signal is distilled into the 3D embedding space.

The objective

Given a 2D view $v$ and its corresponding 3D structure $x_{3D}$, the loss is a latent-space regression:

$$\mathcal{L} = \big\lVert\, g_\phi\big(f_\theta(v)\big) - \operatorname{sg}\big[\bar f_\theta(x_{3D})\big]\,\big\rVert_2^2,$$

where $\operatorname{sg}$ is the stop-gradient and $\bar f_\theta$ is the (EMA) 3D target encoder. Because the objective predicts 3D latents from the 2D-derived context and never reconstructs pixels or points, all supervision flows from the cheaply available 2D modality into the 3D representation through the cross-modal correspondence.

Key results & what's novel

CrossJEPA reframes 2D-to-3D transfer as latent prediction: by predicting 3D latents from 2D images in a shared embedding space, it distills the rich, cheaply available 2D signal into 3D representations and sidesteps costly native 3D pretraining. This yields efficient 3D representation learning and extends the JEPA family to a genuinely cross-modal setting where the context and target live in different sensory domains — distinct from within-modality 3D JEPAs like Point-JEPA and 3D-JEPA. The contribution is showing that the latent-prediction objective is an effective vehicle for cross-modal knowledge transfer into a data-scarce target modality.

Strengths & limitations

  • + Leverages abundant, information-rich 2D encoders to learn 3D efficiently.
  • + Avoids costly native 3D pretraining and any point/pixel reconstruction.
  • + Demonstrates JEPA latent prediction as a cross-modal transfer mechanism.
  • Requires accurate 2D-3D correspondences (renders or calibrated views) for training.
  • 3D representation quality is bounded by what the 2D views actually capture; occluded or view-invisible structure is hard to learn.
  • A representation learner, not generative; no 3D synthesis or dynamics.

Connections & references

Builds onI-JEPA