Motivation
I-JEPA's efficiency rests on selecting contiguous blocks of patches as context and targets — a notion that depends on the regular grid structure of images. Point clouds are unordered, irregular sets with no canonical ordering, so naive masking produces spatially incoherent target groups scattered across the object, making the prediction task ill-posed. Point-JEPA brings latent masked prediction to 3D point data while preserving the locality that makes block masking meaningful, without resorting to a heavy point-reconstruction decoder.
How it works
Points are grouped into local patches — centroids with their k-nearest-neighbor neighborhoods — and tokenized by a small PointNet.
- A learned token sequencer orders these patch tokens by spatial proximity, so that contiguous spans in the ordering correspond to coherent spatial regions.
- A context encoder $f_\theta$ embeds the visible patches.
- An EMA target encoder $\bar f_\theta$ produces the latents of masked target patch-blocks.
- A predictor $g_\phi$ regresses those target latents from the context, with no point reconstruction.
The masking unit is a contiguous group of point patches in the induced ordering — recovering the coherent-block structure that block-masking JEPAs rely on.
The objective
For masked target patch-blocks $k=1\dots M$, the loss is the latent $\ell_2$ distance:
$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The proximity-based sequencer is what makes the masked blocks $k$ spatially coherent; the objective itself is the standard JEPA representation-space regression, with no reconstruction of point coordinates and no contrastive negatives.
Key results & what's novel
Point-JEPA delivers competitive 3D representations with greater pretraining efficiency than point masked-autoencoders, since it predicts abstract latents rather than reconstructing coordinates. Its central contribution is the learned token sequencer: imposing a proximity-based ordering on otherwise unordered point patches recovers the notion of a coherent spatial block, letting the efficient image-JEPA masking and latent-prediction scheme operate directly on 3D geometry. The sequencer is a reusable trick for applying block-masking JEPAs to set-structured data more broadly, beyond point clouds.
Strengths & limitations
- + Recovers coherent block masking on unordered sets via a learned proximity ordering.
- + More pretraining-efficient than point masked-autoencoders; decoder-free.
- + The sequencer generalizes to other set-structured data.
- − Patch grouping (k-NN, centroid count) and sequencer add design complexity and hyperparameters.
- − Locality from proximity ordering may not capture long-range or topological structure well.
- − A representation learner; no geometry generation and no dynamics/action.