stable-worldmodel — World Modeling

At a glance

ProblemWorld-model research is fragmented across incompatible data formats, ad hoc baselines, bespoke planners, and uncontrolled evaluation, so results are hard to reproduce or compare.

Key ideaProvide a standardized research stack: shared data conversion, reference baselines, a library of planning solvers, and controllable evaluation environments.

ModalityFramework (modality-agnostic; vision + actions in practice)

Target / maskingIncludes JEPA-style baselines with latent prediction and anti-collapse; not a single model.

Builds onJEPA action-conditioned world models and the LeJEPA/LeWorldModel line of stable training.

Used forReproducible, apples-to-apples training, planning, and evaluation of world models.

Motivation

Progress in world models is throttled less by ideas than by infrastructure. Every paper converts trajectory data its own way, reimplements baselines from scratch, writes a bespoke planner, and evaluates in a setup no one else can replicate. The result is a literature where two methods can rarely be compared without confounds, and where a reported gain might come from the encoder, the planner, the data pipeline, or the evaluation. stable-worldmodel addresses this by supplying a shared, reproducible stack so that representation learning, dynamics modeling, and planning can each be swapped while everything else is held fixed.

What's inside

The stack bundles four layers. Data conversion utilities normalise heterogeneous trajectory datasets into a common state-action interface. Reference baselines include JEPA-style latent world models — a context encoder, an action-conditioned predictor $\hat z_{t+1}=g_\phi(z_t,a_t)$, and a latent prediction objective with anti-collapse — so a standard model is always available to compare against. A library of planning solvers operates over learned latent dynamics: sampling-based model-predictive control, gradient-based optimisation, and value- or search-guided variants. Finally, controllable evaluation environments let task difficulty, observations, and dynamics be varied systematically, exposing how methods degrade as conditions change.

How you use it

A typical workflow holds the pipeline fixed and varies one piece. You convert a dataset into the common format, pick or train an encoder and predictor against a JEPA baseline, attach a planning solver from the library, and evaluate in a controllable environment with chosen difficulty settings. Because the interfaces are shared, you can substitute a new encoder, a new predictor, or a new planner — for example swapping sampling-based MPC for value-guided search — without touching the rest. This makes controlled ablation the default mode of experimentation rather than an afterthought, and lets results from different groups be compared directly.

Key results & what's novel

The contribution is the platform itself: a unified, reproducible stack that decouples representation learning, dynamics modeling, and planning so they can be studied independently and combined freely. By standardising data format, baselines, solvers, and evaluation, it enables apples-to-apples comparison and controlled ablation — the experimental substrate that underpins reliable claims about stable end-to-end JEPA world models. It plays the role for world-model research that benchmark suites and shared codebases played for supervised learning, guarding against irreproducible results.

Strengths & limitations

+ Reproducible, controlled comparisons via shared data, baselines, solvers, and environments.
+ Cleanly decouples encoder, dynamics, and planner so each can be swapped.
+ Lowers the engineering barrier to rigorous world-model research.
− Conclusions are bounded by the environments and baselines the stack ships with.
− A common interface can constrain methods that do not fit the state-action abstraction.
− As infrastructure it offers no new modeling idea on its own.

Connections & references

Builds onLeWorldModel V-JEPA 2

Paper ↗Code ↗