At a glance
ProblemProtein language models are dominated by masked language modelling (MLM); it is unclear whether abstract latent prediction can capture evolutionary and structural constraints better than per-residue token prediction.
Key ideaPredict EMA target-encoder embeddings at masked residue positions — but only as a complement to MLM. JEPA-only collapses; MLM + masked-position JEPA wins.
ModalityProtein sequence
Target / maskingMasked residue positions (and the spans/domains they cover); an EMA target encoder supplies latent targets.
Builds onI-JEPA's masked latent-prediction recipe; MLM protein language models.
Used forStability, variant effect, remote homology, enzyme classification, fold retrieval.

Motivation

For proteins, the dominant self-supervised objective is masked language modelling (MLM): mask residues and predict their identity. The JEPA principle suggests an alternative — predict abstract latent targets instead of token identities — which in vision and elsewhere captures higher-level structure. ProteinJEPA (2026) asks whether that transfers to protein sequences, where the predictable signal is evolutionary and structural constraint. The motivating question is whether latent prediction can replace, or merely augment, token-level reconstruction for capturing these constraints.

How it works

Protein sequenceresidue/domains · masked positionsContext encoderf_θTarget encoderf̄_θ · EMAPredictorg_φlatent loss‖ẑ − sg(z̄)‖²z_ctxz̄ (sg)EMA copylocal loss (e.g. MLM)
Canonical JEPA schematic for Protein sequence. The input is split into a visible context and hidden targets (residue/domain-level, masked positions). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

A protein sequence is encoded by a context encoder. A predictor then predicts, at masked residue positions, the embeddings produced by an EMA target encoder, using a latent prediction loss with stop-gradient.

  • The masking unit is residue positions and, by extension, the spans/domains they cover.
  • Crucially the design is hybrid: the masked-position JEPA term is trained alongside the standard MLM objective.

The decisive empirical lesson is that the latent term cannot stand alone: JEPA-only collapses in nearly every experiment. The MLM term acts as a grounding, anti-collapse anchor, while the masked-position latent term enriches and regularises the representation.

The objective

The training loss adds a masked-position latent term to MLM:

$$\mathcal{L} = \mathcal{L}_{\text{MLM}} + \lambda \sum_{k\in\text{mask}} \big\lVert\, g_\phi(z_{\text{ctx}}, m_k) - \operatorname{sg}[\bar f_\theta(x)_k]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and $\bar f_\theta$ an EMA of the encoder. Removing $\mathcal{L}_{\text{MLM}}$ (the JEPA-only ablation) leads to representational collapse, so the reconstruction term is not optional here — it is what makes the latent term useful.

Key results & what's novel

The headline is a negative-then-positive result. JEPA-only collapses across nearly every experiment, but the hybrid MLM + masked-position JEPA objective beats MLM-only under matched compute. The hybrid improves stability prediction, variant effect prediction, remote homology detection, enzyme classification, and fold retrieval. The novelty is the explicit prescription: in the protein domain latent prediction is a complement that regularises and enriches a token-level objective, not a standalone replacement — a concrete instance of the broader biology lesson that latent objectives should be paired with a grounding signal.

Strengths & limitations

  • + Hybrid objective beats MLM-only under matched compute across five protein tasks.
  • + Clear, honest characterisation of when the latent term helps and when it fails.
  • JEPA-only collapses — the method depends on an MLM anchor and cannot drop reconstruction.
  • Introduces an EMA encoder, predictor, and loss-weight $\lambda$ to tune.
  • Gains are incremental complements to MLM rather than a new standalone paradigm for proteins.

Connections & references

Builds onI-JEPA