Motivation
Representation learning tells you where a cell is in latent space; it does not tell you where it will move under a knockout or a compound. Drug-discovery decisions are counterfactual — target validation, hit and lead selection, combination and dose choices all ask how state changes after an intervention. BioJEPA-AC takes the action-conditioned world-model idea proven on video and robotics (V-JEPA 2-AC) and points it at the cell, treating an intervention as the action.
How it works
The model couples a state encoder with an action-conditioned predictor:
- A state encoder $f_\theta$ maps the pre-perturbation profile $x_{\text{pre}}$ to a latent $z_{\text{pre}}$ (the Cell-JEPA role).
- An action encoder represents the intervention $a$ — a gene knockout/knockdown, or a compound with dose and time.
- A predictor $g_\phi$ maps $(z_{\text{pre}}, a)$ to the predicted post-state latent $\hat z_{\text{post}}$.
- An EMA target encoder embeds the actual measured post-state $x_{\text{post}}$ to supervise the prediction.
Once trained, it is used as a simulator: encode a starting state, apply candidate interventions, and search for the one whose predicted latent lands in a desired region (e.g. toward a healthy and away from a disease manifold).
The objective
The action-conditioned latent loss mirrors V-JEPA 2-AC, with interventions in place of robot actions:
$$\mathcal{L} = \big\lVert\, g_\phi(z_{\text{pre}}, a) - \operatorname{sg}\big[\bar f_\theta(x_{\text{post}})\big] \,\big\rVert_2^2 \;+\; \lambda\,\mathcal{R}(z),$$
where $\mathcal{R}$ is an anti-collapse regulariser. Planning then solves $\min_{a}\,\lVert \hat z_{\text{post}}(a) - z^* \rVert$ over candidate interventions toward a target state $z^*$.
Key results & what's novel
BioJEPA-AC is an open, in-progress project (GPTomics) rather than a finished benchmark paper, so it is best read as a concrete architecture proposal: the first explicit attempt to build a V-JEPA-2-style, action-conditioned world model of the cell. Its significance is the framing — turning self-supervised cell representations into a controllable, counterfactual simulator of perturbation response, which is precisely the missing piece between a good cell encoder (Cell-JEPA) and decision-making in the drug-discovery pipeline.
Strengths & limitations
- + Directly answers counterfactual, intervention-level questions; supports planning/search.
- + Reuses unlabelled state pretraining; interventions slot in as actions.
- − Needs paired pre/post perturbation data; evaluation must hold out unseen interventions, targets and contexts, not just random cells.
- − Causal validity is not automatic — latent prediction can capture correlation; identifiability theory (see the LeJEPA world-model analysis) warns the required assumptions are strong for biology.
- − Early-stage; quantitative performance is not yet established.