Motivation
Self-supervised 3D pretraining is dominated by point and voxel reconstruction, which spends model capacity on reproducing exact coordinates and is sensitive to acquisition artifacts — varying sampling density, scanning noise, and partial views. These low-level targets do not align well with the semantic and structural content that downstream tasks (classification, segmentation, detection) actually need. 3D-JEPA targets richer object- and scene-level representations by predicting in latent space, where the target abstracts away the precise geometry the reconstruction objective is forced to model.
How it works
A 3D object or scene is partitioned into local regions — patches of points or voxel blocks — that serve as the masking unit.
- A context encoder $f_\theta$ embeds the visible 3D regions.
- An EMA target encoder $\bar f_\theta$ produces the latents of masked regions as targets, with gradients stopped.
- A predictor $g_\phi$, conditioned on the spatial position of each target region, predicts those target latents from the visible context.
Positional conditioning supplies the 3D coordinates the predictor needs to localize each masked target, the analogue of I-JEPA's positional mask tokens. No explicit geometry reconstruction and no contrastive negatives are involved.
The objective
For masked 3D regions $k=1\dots M$ with spatial positions $p_k$, the objective is a latent $\ell_2$ regression:
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, p_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. By regressing the abstract embedding of each masked region rather than its point coordinates, the loss is insensitive to exact sampling and rewards capturing structural content localized by the conditioning position $p_k$.
Key results & what's novel
3D-JEPA extends the JEPA recipe to volumetric and object-level 3D understanding. Because it predicts abstract latents of masked regions rather than their precise point coordinates, the encoder is encouraged to capture semantic and structural content — shape parts, scene layout — that is invariant to point density and acquisition noise. This makes it a decoder-free, sampling-robust alternative to 3D masked autoencoders, with representations that transfer to downstream classification, segmentation, and detection. The position-conditioned predictor is the key mechanism that localizes each masked target in 3D space.
Strengths & limitations
- + Robust to sampling density and noise by predicting abstract latents, not coordinates.
- + Decoder-free alternative to 3D masked autoencoders; scales to object and scene level.
- + Position conditioning cleanly localizes masked 3D targets.
- − Region partitioning and masking design add hyperparameters specific to 3D.
- − Abstract targets discard fine geometric detail that some tasks may need.
- − A static representation learner; no scene dynamics or action conditioning.