Motivation
Sequential recommendation predicts a user's next interactions from their history. Most models encode items as learned ID embeddings, which carry no inherent semantics: a never-before-seen (cold-start) item has no meaningful embedding, and IDs from one domain are useless in another. Rich item text — titles, descriptions — is ignored. JEPA4Rec aims to ground item representations in language so that semantics, rather than arbitrary IDs, drive prediction, enabling generalization to new items and domains while still capturing sequential behavior patterns.
How it works
JEPA4Rec encodes each item from its textual description using a language model, then applies the joint-embedding predictive recipe over interaction sequences.
- A context encoder processes a masked user history of language-grounded item embeddings.
- A predictor estimates the latent representations of masked target items.
- An EMA target encoder supplies the stop-gradient target item embeddings.
The model learns to anticipate future items in a semantic language-representation space rather than predicting discrete IDs or reconstructing raw text. Because supervision is in embedding space, there is no negative sampling, and the language grounding means the same representation transfers to items and domains unseen during training.
The objective
Training minimizes the latent distance between predicted and target item embeddings over masked positions:
$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{ctx}}, m) - \operatorname{sg}(z_{\text{tgt}})\,\big\rVert_2^2$$
where $z_{\text{ctx}}$ encodes the visible history, $z_{\text{tgt}}$ is the EMA-encoded representation of a masked item (updated as $\bar\theta \leftarrow \tau\,\bar\theta + (1-\tau)\,\theta$), and $\operatorname{sg}$ is stop-gradient. Predicting in the language-grounded embedding space ties the recommendation signal to item semantics rather than to memorized IDs.
Key results & what's novel
JEPA4Rec extends joint-embedding prediction to recommender systems, bridging language-model item encoders with sequential modeling. Its novelty is learning language-grounded item representations and predicting masked items in embedding space, which yields transferable semantics that support cold-start and cross-domain recommendation without negative sampling. It demonstrates that JEPA's masked latent-prediction recipe applies well beyond perception — to user-behavior sequences — producing semantically rich, generalizable item embeddings that improve recommendation under sparse and cold-start conditions.
Strengths & limitations
- + Language-grounded items generalize to cold-start and cross-domain settings.
- + Embedding-space prediction avoids negative sampling.
- + Captures sequential structure while leveraging item text.
- − Depends on a language encoder and the availability of item text.
- − Text encoding adds computational cost over plain ID lookup.
- − Predicting expected embeddings may underrepresent diverse plausible next items.