Point-JEPA — World Modeling

At a glance

ProblemPoint clouds are unordered, irregular sets, so the contiguous block masking that makes JEPA efficient on images has no natural analogue.

Key ideaA learned token sequencer orders point patches by spatial proximity so coherent contiguous blocks can be masked and predicted in latent space.

Modality3D point clouds

Target / maskingContiguous groups of point patches in the induced proximity ordering; targets are EMA latents.

Builds onI-JEPA block masking; PointNet patch tokenization.

Used for3D shape pretraining, classification, segmentation transfer.

Motivation

I-JEPA's efficiency rests on selecting contiguous blocks of patches as context and targets — a notion that depends on the regular grid structure of images. Point clouds are unordered, irregular sets with no canonical ordering, so naive masking produces spatially incoherent target groups scattered across the object, making the prediction task ill-posed. Point-JEPA brings latent masked prediction to 3D point data while preserving the locality that makes block masking meaningful, without resorting to a heavy point-reconstruction decoder.

How it works

Canonical JEPA schematic for Point cloud. The input is split into a visible context and hidden targets (point patch-level, block). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

Points are grouped into local patches — centroids with their k-nearest-neighbor neighborhoods — and tokenized by a small PointNet.

A learned token sequencer orders these patch tokens by spatial proximity, so that contiguous spans in the ordering correspond to coherent spatial regions.
A context encoder $f_\theta$ embeds the visible patches.
An EMA target encoder $\bar f_\theta$ produces the latents of masked target patch-blocks.
A predictor $g_\phi$ regresses those target latents from the context, with no point reconstruction.

The masking unit is a contiguous group of point patches in the induced ordering — recovering the coherent-block structure that block-masking JEPAs rely on.

The objective

For masked target patch-blocks $k=1\dots M$, the loss is the latent $\ell_2$ distance:

$$\mathcal{L} = \frac{1}{M}\sum_{k} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}\big[\bar f_\theta(x)_k\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The proximity-based sequencer is what makes the masked blocks $k$ spatially coherent; the objective itself is the standard JEPA representation-space regression, with no reconstruction of point coordinates and no contrastive negatives.

Key results & what's novel

Point-JEPA delivers competitive 3D representations with greater pretraining efficiency than point masked-autoencoders, since it predicts abstract latents rather than reconstructing coordinates. Its central contribution is the learned token sequencer: imposing a proximity-based ordering on otherwise unordered point patches recovers the notion of a coherent spatial block, letting the efficient image-JEPA masking and latent-prediction scheme operate directly on 3D geometry. The sequencer is a reusable trick for applying block-masking JEPAs to set-structured data more broadly, beyond point clouds.

Strengths & limitations

+ Recovers coherent block masking on unordered sets via a learned proximity ordering.
+ More pretraining-efficient than point masked-autoencoders; decoder-free.
+ The sequencer generalizes to other set-structured data.
− Patch grouping (k-NN, centroid count) and sequencer add design complexity and hyperparameters.
− Locality from proximity ordering may not capture long-range or topological structure well.
− A representation learner; no geometry generation and no dynamics/action.

Connections & references

Builds onI-JEPA

Related3D-JEPA CrossJEPA I-JEPA V-JEPA

Paper ↗