At a glance
Problem3D SSL is dominated by point/voxel reconstruction, which wastes capacity on exact coordinates and is sensitive to sampling density and noise.
Key ideaPredict abstract latents of masked 3D regions, conditioned on their spatial position, rather than reconstructing precise geometry.
Modality3D objects and scenes (point/voxel regions)
Target / maskingLocal 3D regions (point patches or voxel blocks); position-conditioned EMA latent targets.
Builds onI-JEPA latent masked prediction; 3D masked autoencoders.
Used for3D classification, segmentation, detection; object- and scene-level understanding.

Motivation

Self-supervised 3D pretraining is dominated by point and voxel reconstruction, which spends model capacity on reproducing exact coordinates and is sensitive to acquisition artifacts — varying sampling density, scanning noise, and partial views. These low-level targets do not align well with the semantic and structural content that downstream tasks (classification, segmentation, detection) actually need. 3D-JEPA targets richer object- and scene-level representations by predicting in latent space, where the target abstracts away the precise geometry the reconstruction objective is forced to model.

How it works

Point cloudpoint patchs · blockContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Point cloud. The input is split into a visible context and hidden targets (point patch-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

A 3D object or scene is partitioned into local regions — patches of points or voxel blocks — that serve as the masking unit.

  • A context encoder $f_\theta$ embeds the visible 3D regions.
  • An EMA target encoder $\bar f_\theta$ produces the latents of masked regions as targets, with gradients stopped.
  • A predictor $g_\phi$, conditioned on the spatial position of each target region, predicts those target latents from the visible context.

Positional conditioning supplies the 3D coordinates the predictor needs to localize each masked target, the analogue of I-JEPA's positional mask tokens. No explicit geometry reconstruction and no contrastive negatives are involved.

The objective

For masked 3D regions $k=1\dots M$ with spatial positions $p_k$, the objective is a latent $\ell_2$ regression:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, p_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. By regressing the abstract embedding of each masked region rather than its point coordinates, the loss is insensitive to exact sampling and rewards capturing structural content localized by the conditioning position $p_k$.

Key results & what's novel

3D-JEPA extends the JEPA recipe to volumetric and object-level 3D understanding. Because it predicts abstract latents of masked regions rather than their precise point coordinates, the encoder is encouraged to capture semantic and structural content — shape parts, scene layout — that is invariant to point density and acquisition noise. This makes it a decoder-free, sampling-robust alternative to 3D masked autoencoders, with representations that transfer to downstream classification, segmentation, and detection. The position-conditioned predictor is the key mechanism that localizes each masked target in 3D space.

Strengths & limitations

  • + Robust to sampling density and noise by predicting abstract latents, not coordinates.
  • + Decoder-free alternative to 3D masked autoencoders; scales to object and scene level.
  • + Position conditioning cleanly localizes masked 3D targets.
  • Region partitioning and masking design add hyperparameters specific to 3D.
  • Abstract targets discard fine geometric detail that some tasks may need.
  • A static representation learner; no scene dynamics or action conditioning.

Connections & references

Builds onI-JEPA