JEPA-DNA — World Modeling

At a glance

ProblemGenomic foundation models trained purely with token-level MLM or next-token prediction optimise local nucleotide statistics and underrepresent global, long-range sequence structure.

Key ideaA model-agnostic continual-training framework that grounds an existing genomic backbone with a JEPA latent objective over global sequence embeddings, run alongside the token-level loss.

ModalityDNA sequence

Target / maskingGenomic spans / sequence segments; an EMA target encoder embeds a target view to supply latent targets.

Builds onI-JEPA's latent-prediction recipe; token-level MLM/NTP genomic language models.

Used forLinear-probe and zero-shot genomic prediction across many benchmark tasks.

Motivation

Genomic foundation models are usually trained with masked language modelling (MLM) or next-token prediction (NTP). Both objectives optimise local nucleotide statistics — what base is likely given its neighbours — and can therefore underrepresent the global, long-range regulatory structure that governs gene function. Retraining a large genomic backbone from scratch to fix this is expensive. JEPA-DNA (Larey et al., 2026) asks instead whether a global, higher-order signal can be injected into an existing model without changing its architecture, by adding a latent-prediction objective on top of the familiar token-level loss.

How it works

Canonical JEPA schematic for DNA sequence. The input is split into a visible context and hidden targets (sequence span-level, masked spans). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

JEPA-DNA is a model-agnostic, continual-training wrapper around a pretrained genomic model.

A context encoder embeds a visible view of the sequence; a predictor maps that context to the latent representation of a target view.
An EMA target encoder embeds the target view to supply stop-gradient targets, defined over global sequence embeddings rather than per-token outputs.
Simultaneously the original token-level MLM/NTP loss is retained.

The masking unit is genomic spans — regulatory regions, motifs, or sequence segments. The latent objective grounds the backbone in higher-order organisation while the token loss keeps it anchored to nucleotide identity, making this a hybrid generative-plus-latent scheme.

The objective

The total loss combines the existing token-level term with a JEPA latent term over global embeddings:

$$\mathcal{L} = \mathcal{L}_{\text{MLM/NTP}} + \lambda\,\big\lVert\, g_\phi(z_{\text{ctx}}) - \operatorname{sg}[\bar f_\theta(x_{\text{tgt}})]\,\big\rVert_2^2,$$

where $\operatorname{sg}$ is stop-gradient and $\bar f_\theta$ is the EMA target encoder. The token term anchors the model and prevents the latent objective from collapsing; the latent term grounds the global representation. Because the framework is continual and model-agnostic, it upgrades incumbent backbones in place.

Key results & what's novel

JEPA-DNA reports consistent improvements across 17 genomic benchmark tasks under both linear-probe and zero-shot evaluation. Because gains appear in linear probing and zero-shot — settings that read off representation quality directly — the result indicates better-structured, more transferable embeddings rather than merely a better-tuned head. The conceptual novelty is the recipe: rather than choosing between a token-level objective and a latent one, it shows the two can be combined as a continual, architecture-agnostic upgrade that grounds a genomic model in global structure without retraining from scratch.

Strengths & limitations

+ Model-agnostic and continual — upgrades existing backbones in place, no architecture changes.
+ Consistent gains over 17 tasks in linear-probe and zero-shot settings.
+ The token loss prevents the latent objective from collapsing.
− Adds a second objective and an EMA encoder, increasing training cost and a loss-weight hyperparameter to balance.
− Defining informative global target views and span masks is non-trivial.
− Gains are characterised on benchmark suites; behaviour on very long-range genomic reasoning is still to be probed.

Connections & references

Builds onI-JEPA

RelatedCell-JEPA ProteinJEPA Graph-JEPA

Paper ↗