At a glance
ProblemSequential recommenders built on item-ID embeddings generalize poorly to cold-start items and transfer weakly across domains, since IDs carry no semantic content.
Key ideaRepresent items by language-model encodings of their text, then predict masked future items in that semantic embedding space using a JEPA objective.
ModalityItem interaction sequences (language-grounded)
Target / maskingMasked items in a user history whose EMA-encoded language embeddings are predicted from the visible context.
Builds onI-JEPA's masked latent prediction, applied to behavior sequences.
Used forSequential recommendation, especially cold-start and cross-domain settings.

Motivation

Sequential recommendation predicts a user's next interactions from their history. Most models encode items as learned ID embeddings, which carry no inherent semantics: a never-before-seen (cold-start) item has no meaningful embedding, and IDs from one domain are useless in another. Rich item text — titles, descriptions — is ignored. JEPA4Rec aims to ground item representations in language so that semantics, rather than arbitrary IDs, drive prediction, enabling generalization to new items and domains while still capturing sequential behavior patterns.

How it works

Item sequenceitems · masked-itemsContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copy
Canonical JEPA schematic for Item sequence. The input is split into a visible context and hidden targets (item-level, masked-items). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

JEPA4Rec encodes each item from its textual description using a language model, then applies the joint-embedding predictive recipe over interaction sequences.

  • A context encoder processes a masked user history of language-grounded item embeddings.
  • A predictor estimates the latent representations of masked target items.
  • An EMA target encoder supplies the stop-gradient target item embeddings.

The model learns to anticipate future items in a semantic language-representation space rather than predicting discrete IDs or reconstructing raw text. Because supervision is in embedding space, there is no negative sampling, and the language grounding means the same representation transfers to items and domains unseen during training.

The objective

Training minimizes the latent distance between predicted and target item embeddings over masked positions:

$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}, m) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$

where $z_{\text{ctx}}$ encodes the visible history, $z_{\text{tgt}}$ is the EMA-encoded representation of a masked item (updated as $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$), and $\operatorname{sg}$ is stop-gradient. Predicting in the language-grounded embedding space ties the recommendation signal to item semantics rather than to memorized IDs.

Key results & what's novel

JEPA4Rec extends joint-embedding prediction to recommender systems, bridging language-model item encoders with sequential modeling. Its novelty is learning language-grounded item representations and predicting masked items in embedding space, which yields transferable semantics that support cold-start and cross-domain recommendation without negative sampling. It demonstrates that JEPA's masked latent-prediction recipe applies well beyond perception — to user-behavior sequences — producing semantically rich, generalizable item embeddings that improve recommendation under sparse and cold-start conditions.

Strengths & limitations

  • + Language-grounded items generalize to cold-start and cross-domain settings.
  • + Embedding-space prediction avoids negative sampling.
  • + Captures sequential structure while leveraging item text.
  • Depends on a language encoder and the availability of item text.
  • Text encoding adds computational cost over plain ID lookup.
  • Predicting expected embeddings may underrepresent diverse plausible next items.

Connections & references

Builds onI-JEPA