Graph-JEPA — World Modeling

At a glance

ProblemGraph-level self-supervised learning needs a whole-graph representation, but contrastive graph methods rely on unclear augmentations and topology reconstruction is brittle.

Key ideaBring the augmentation-free latent-prediction recipe to graphs: predict the embedding of a masked subgraph from the representation of the visible context.

ModalityGraph

Target / maskingMask subgraphs; a target encoder embeds the masked region to supply latent targets.

Builds onI-JEPA-style block prediction adapted to non-Euclidean graph topology.

Used forTransferable whole-graph representations for downstream graph classification.

Motivation

Many tasks need a representation of an entire graph — a molecule, for instance — rather than of individual nodes. The two standard self-supervised routes both have weaknesses on graphs. Contrastive graph methods depend on hand-crafted augmentations (edge dropping, node masking) whose semantics are unclear for structured data: it is not obvious which perturbations should leave a graph's meaning invariant. Generative reconstruction of graph topology is brittle. Graph-JEPA (Skenderi et al., 2023) brings the augmentation-free, latent-prediction recipe of I-JEPA to graph-level learning.

How it works

Canonical JEPA schematic for Graph. The input is split into a visible context and hidden targets (subgraph-level, masked subgraph). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

A portion of the graph is masked, and the model predicts the embedding of the masked subgraph from the representation of the visible context.

A context encoder embeds the visible part of the graph.
A predictor maps the context representation to the predicted latent embedding of the masked subgraph.
A target encoder embeds the masked subgraph to supply stop-gradient latent targets.

The masking unit is subgraphs, so the model must infer the representation of missing structural neighbourhoods rather than copy local node features. This adapts I-JEPA-style block prediction from a regular pixel grid to non-Euclidean graph topology — no negatives and no augmentations required.

The objective

The loss is the latent distance between predicted and target subgraph embeddings:

$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}[\bar f_\theta(\text{subgraph})]\,\big\rVert_2^2,$$

where $g_\phi$ is the predictor, $\operatorname{sg}$ is stop-gradient, and $\bar f_\theta$ is the target encoder. As in the image case, the asymmetric target encoder together with the predictor and subgraph masking provides the learning signal without contrastive negatives or augmentations.

Key results & what's novel

The key contribution is to show that graph-level joint-embedding prediction is a viable SSL objective: predicting masked-subgraph embeddings produces transferable whole-graph representations without negatives or augmentations. This matters because it removes the awkward dependence on graph augmentations whose invariances are ill-defined. As foundational graph-SSL methodology, Graph-JEPA establishes that latent subgraph prediction yields useful structure, and it serves as the basis the molecular and polymer JEPAs build on.

Strengths & limitations

+ No contrastive negatives and no hand-crafted graph augmentations.
+ Whole-graph representations that transfer to downstream tasks.
+ Cleanly adapts the I-JEPA block-prediction idea to non-Euclidean topology.
− Performance depends on how subgraphs are sampled and masked.
− Predicting an expected target embedding can wash out fine structural detail; it is not generative.
− Learns a static graph representation, with no notion of dynamics.

Connections & references

Builds onI-JEPA

Paper ↗