LeWorldModel: Legendre World Model — A Compact JEPA for Real-Time Robotic Perception
1. Introduction
Self-supervised world models aspire to learn compressed, predictive representations of sensory streams that are sufficient for planning and control. The Joint-Embedding Predictive Architecture (JEPA) family has established that predicting in representation space rather than pixel space sidesteps intractable generative modelling while still capturing the abstract structure an agent needs. Yet the dominant JEPA variants — I-JEPA, V-JEPA, and their successors — target large-scale pretraining on GPU clusters, producing encoders with hundreds of millions of parameters. Deploying such models on size-, weight-, and power-constrained robotic platforms remains impractical: a ViT-H encoder at 630M parameters cannot run at real-time frame rates on an embedded Jetson or microcontroller-class device.
LeWorldModel addresses this gap directly. It builds on the theoretical scaffolding of LeJEPA (Legendre JEPA), which introduced a spectral regulariser grounded in the Legendre–Fenchel duality to provide formal non-collapse guarantees for JEPA training. Where LeJEPA demonstrated these guarantees in a standard pretraining regime on ImageNet-scale data, LeWorldModel asks: can we compress the entire JEPA pipeline — encoder, target encoder, predictor, and regulariser — into a 15-million-parameter world model that trains end-to-end from raw pixels and runs in real time on embedded hardware?
The contributions of the paper are threefold:
- Architectural compression. A redesigned encoder–predictor pair that reduces the full LeJEPA pipeline by roughly 40× in parameter count while retaining the Legendre-dual spectral regularisation (SIGReg) that prevents representational collapse.
- End-to-end pixel training. Unlike prior JEPA methods that assume a frozen patch-embedding front-end or a pretrained tokeniser, LeWorldModel trains from raw RGB frames with no auxiliary reconstruction or pixel-level loss, relying solely on the joint-embedding prediction objective.
- Embedded deployment. The paper demonstrates real-time inference (≥30 Hz) on NVIDIA Jetson Orin-class hardware, with ablations showing that the Legendre stability guarantees remain intact at the 15M-parameter scale — a regime where prior JEPA methods often collapse or produce degenerate representations.
How LeWorldModel differs from LeJEPA
LeJEPA is a pretraining method: it defines a regularised energy-based objective and validates it on standard vision benchmarks (ImageNet classification, low-shot transfer). It makes no claims about model size, deployment latency, or world-model utility. LeWorldModel is an application architecture: it takes the LeJEPA objective as a fixed loss and co-designs the encoder, predictor, and training schedule to minimise parameter count while preserving the formal stability properties. The relationship is analogous to that between BERT (a pretraining objective) and DistilBERT or TinyBERT (compressed deployable variants): the theoretical framework is inherited, but the engineering contribution is orthogonal.
A second distinction is scope. LeJEPA operates on static images; LeWorldModel extends the framework to short temporal sequences (2–8 frames), adding a lightweight temporal predictor that forecasts future-frame representations given current-frame context and an action embedding. This turns the architecture into a proper world model in the reinforcement-learning sense: given state $s_t$ and action $a_t$, predict the next-state representation $\hat{s}_{t+1}$.
2. Method
The method has three interlocking ideas:
Idea 1: Compact Encoder with Preserved Spectral Health
Standard JEPA encoders use wide ViT backbones (embedding dimension 768–1280) with 12–32 transformer layers. LeWorldModel replaces this with a thin ViT variant: fewer layers, a narrower embedding dimension, and smaller patch size to compensate for the reduced model capacity. The critical insight is that spectral regularisation (SIGReg) becomes more important, not less, at small scale. Without it, the narrow bottleneck quickly collapses to a low-rank subspace. With it, the singular-value spectrum of the representation matrix remains flat, ensuring that every dimension carries information.
Idea 2: End-to-End Pixel Training
Prior JEPA methods typically initialise the patch-embedding layer with a fixed linear projection and train only the transformer blocks. LeWorldModel makes the patch embedding learnable and deep: a small convolutional stem (two 3×3 conv layers with batch normalisation) replaces the single linear projection. This allows the model to learn task-relevant low-level features (edges, textures relevant to physics) from raw pixels without a separate pretraining stage. The entire pipeline — convolutional stem, transformer encoder, predictor — is trained jointly with a single loss.
Idea 3: Temporal Prediction for World Modelling
A static JEPA predicts masked spatial regions within a single frame. LeWorldModel adds a temporal axis: given a context representation from frame $t$ and a discretised action token $a_t$, the predictor forecasts the representation of frame $t+1$. This is not autoregressive generation — there is no pixel decoder. Instead, the target encoder processes the actual future frame, and the predictor's output is compared to this target in representation space. The action conditioning is implemented via a simple additive embedding, keeping the predictor lightweight.
3. Model Overview
At-a-Glance
| Component | Detail |
|---|---|
| Input | Raw RGB frames, 128×128 or 224×224 resolution |
| Masking | Spatial block masking on context frame (LeJEPA-style); temporal next-frame prediction with action conditioning |
| Encoder | Thin ViT with convolutional stem, ~10M params |
| Target Encoder | EMA copy of encoder (no gradients), exponential schedule $\tau$: 0.996 → 1.0 |
| Predictor | Lightweight transformer (4 layers, narrow dim), ~4M params, action-conditioned |
| Loss | Smooth-L1 prediction loss + SIGReg spectral regularisation |
| Key Result | Competitive representation quality at 15M params; ≥30 Hz on Jetson Orin |
| Total Params | ~15M (encoder + predictor; target encoder shares weights via EMA) |
Training Architecture Diagram
4. Main Components of LeWorldModel
4.1 Encoder ($f_\theta$)
WHAT. The online encoder maps a raw RGB frame (after spatial masking) to a sequence of patch-level representations. It consists of two sub-modules: a convolutional stem and a Vision Transformer backbone.
HOW. The convolutional stem comprises two 3×3 convolutional layers with stride 2, batch normalisation, and GELU activation. For 128×128 input, this produces a 32×32 feature map, which is then divided into non-overlapping 4×4 patches, yielding $N = 64$ patch tokens (equivalently, an effective patch size of 8×8 at the input level). Each patch token is projected to dimension $D = 384$. The ViT backbone uses 6 transformer layers with 6 attention heads (head dimension 64), MLP expansion ratio 4 (hidden dim 1536), and standard pre-LayerNorm. Total encoder parameters: approximately 10M.
WHY. The convolutional stem is critical for end-to-end pixel training. Ablation results from the paper show that replacing the conv stem with a standard linear patch projection (as in vanilla ViT) causes a 4–6 point drop in downstream linear-probe accuracy at this model scale. The inductive bias of local connectivity helps the small model extract meaningful low-level features without wasting transformer capacity on pixel-level pattern matching. The choice of $D = 384$ (rather than, e.g., 192 or 768) was validated by a sweep: $D = 192$ degraded representation quality significantly (−8 points linear probe), while $D = 768$ increased parameters to ~35M without commensurate gain (+1.2 points), violating the deployment constraint.
4.2 Target Encoder ($f_\xi$ — EMA)
WHAT. The target encoder is a copy of the online encoder whose parameters $\xi$ are updated via exponential moving average (EMA) of the online parameters $\theta$. It processes the unmasked future frame $x_{t+1}$ and produces target representations against which the predictor's output is compared.
HOW. At each training step:
$$\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$$where $\tau$ follows a cosine schedule from $\tau_0 = 0.996$ to $\tau_1 = 1.0$ over the course of training. No gradients propagate through the target encoder (stop-gradient). The target encoder processes the full, unmasked frame and produces $N$ patch representations, from which only the target-position tokens are extracted for the loss computation.
WHY. The EMA target provides a slowly evolving prediction target that stabilises training. In the small-model regime, the paper reports that removing EMA (i.e., using the online encoder directly as target) causes immediate collapse within the first 5K steps. The cosine schedule for $\tau$ gradually freezes the target encoder as training progresses, providing a curriculum from fast adaptation (early) to stable targets (late). The initial value $\tau_0 = 0.996$ is inherited from LeJEPA and validated as a reasonable default; the paper does not report sensitivity analysis for this specific hyperparameter in the small-model regime.
4.3 Predictor ($g_\phi$)
WHAT. The predictor is a narrow transformer that takes the context-encoder output and an action embedding as input, and predicts the target-encoder representations for the next frame. It is intentionally capacity-limited (a bottleneck) to prevent the predictor from becoming so powerful that the encoder can learn trivial representations.
HOW. The predictor has 4 transformer layers with embedding dimension $D_p = 192$ (half the encoder dimension), 4 attention heads (head dimension 48), and MLP expansion ratio 4. Input to the predictor is a concatenation of: (1) the context tokens $z_c \in \mathbb{R}^{N_c \times D}$, projected to $\mathbb{R}^{N_c \times D_p}$ via a linear layer; (2) learnable mask tokens $m \in \mathbb{R}^{N_t \times D_p}$ with positional embeddings indicating the spatial positions of the target patches; (3) the action embedding $e_a = \text{Embed}(a_t) \in \mathbb{R}^{D_p}$, broadcast and added to all tokens. The output corresponding to mask-token positions is projected back to $\mathbb{R}^{N_t \times D}$ via a linear head. Total predictor parameters: approximately 4M.
WHY. The narrow predictor dimension ($D_p = D/2$) is the primary information bottleneck. The paper's ablation shows that making the predictor as wide as the encoder ($D_p = D = 384$) does not cause full collapse when SIGReg is active, but degrades linear-probe accuracy by ~2 points — the predictor absorbs information that should live in the encoder. Conversely, making it too narrow ($D_p = 96$) impairs prediction quality (−3.5 points). The 4-layer depth was chosen as the minimum that supports the action-conditioned temporal prediction task; a 2-layer predictor was insufficient for multi-step unrolling experiments (discussed in the paper's appendix).
4.4 Masking Strategy
WHAT. LeWorldModel employs a dual masking approach: spatial block masking on the context frame (inherited from I-JEPA/LeJEPA) and temporal masking in the form of next-frame prediction. The context encoder sees a spatially masked version of frame $t$; the predictor must reconstruct full-frame representations for frame $t+1$.
HOW. Spatial masking follows the multi-block strategy from I-JEPA: 4 target blocks are sampled with aspect ratios in $[0.75, 1.5]$ and scale in $[0.15, 0.2]$ of the total patch grid. The context is the complement — all patches not covered by any target block, typically ~25% of patches. For the temporal prediction task, the predictor receives the context patches from frame $t$ and must predict the target-encoder output for all $N$ patches of frame $t+1$. This means the temporal prediction is strictly harder than the spatial prediction: the predictor must reason about both the missing spatial content and the temporal dynamics.
WHY. The spatial masking serves two purposes in the world-model context: (1) it reduces the computational cost of encoding (only ~25% of patches are processed), which is important at deployment time where inference budget is tight; (2) it forces the encoder to produce representations that are informative enough for the predictor to infer missing content, preventing degenerate features. The paper reports that removing spatial masking (processing all context patches) increases training cost by ~4× and paradoxically degrades representation quality by ~1.5 points, consistent with findings from I-JEPA that the masking-induced information bottleneck is a beneficial regulariser.
4.5 Loss Function
WHAT. The total loss is a weighted combination of a prediction loss and a spectral regularisation term. The prediction loss measures how well the predictor's output matches the target-encoder representations; the regulariser prevents the representation space from collapsing.
HOW. Let $\hat{s}_{t+1} = g_\phi(f_\theta(\tilde{x}_t), a_t) \in \mathbb{R}^{B \times N_t \times D}$ denote the predicted representations, where $\tilde{x}_t$ is the masked context frame, $a_t$ is the action, $B$ is the batch size, $N_t$ is the number of target positions, and $D$ is the representation dimension. Let $s_{t+1} = \text{sg}[f_\xi(x_{t+1})] \in \mathbb{R}^{B \times N \times D}$ denote the stop-gradiented target representations from which we extract the $N_t$ target-position tokens, denoted $s_{t+1}^{(T)} \in \mathbb{R}^{B \times N_t \times D}$.
The prediction loss is the Smooth-L1 (Huber) loss averaged over target tokens:
$$\mathcal{L}_{\text{pred}} = \frac{1}{B \cdot N_t} \sum_{b=1}^{B} \sum_{j=1}^{N_t} \text{SmoothL1}\!\left(\hat{s}_{t+1}^{(b,j)},\; s_{t+1}^{(T,b,j)}\right)$$where $\text{SmoothL1}$ is applied element-wise across the $D$ dimensions and summed:
$$\text{SmoothL1}(u, v) = \sum_{d=1}^{D} \begin{cases} \frac{1}{2\beta}(u_d - v_d)^2 & \text{if } |u_d - v_d| < \beta \\ |u_d - v_d| - \frac{\beta}{2} & \text{otherwise} \end{cases}$$with $\beta = 1.0$ (the default Huber threshold).
The SIGReg (Spectral Information Gain Regulariser) is the key contribution inherited from LeJEPA. It operates on the batch-level representation matrix. Let $Z \in \mathbb{R}^{(B \cdot N_c) \times D}$ be the matrix of all context representations in a batch (reshaped from the encoder output). First, $Z$ is centred: $\bar{Z} = Z - \frac{1}{B \cdot N_c}\mathbf{1}\mathbf{1}^\top Z$. Then compute the singular values $\sigma_1, \sigma_2, \ldots, \sigma_D$ of $\bar{Z}$. Normalise them to form a probability distribution:
$$p_k = \frac{\sigma_k^2}{\sum_{d=1}^{D} \sigma_d^2}, \quad k = 1, \ldots, D$$The SIGReg loss is the KL divergence between this distribution and the uniform distribution $u_k = 1/D$:
$$\mathcal{L}_{\text{SIGReg}} = D_{\text{KL}}(p \| u) = \sum_{k=1}^{D} p_k \log(D \cdot p_k)$$This is minimised when $p = u$, i.e., when all singular values are equal (the representation matrix has full rank with a flat spectrum). The connection to Legendre–Fenchel duality, derived in the LeJEPA paper, shows that this regulariser is the convex conjugate of a log-determinant barrier on the covariance matrix, providing a principled information-theoretic motivation rather than an ad-hoc spectral penalty.
The total loss is:
$$\mathcal{L} = \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$$where $\lambda$ controls the regularisation strength. The paper uses $\lambda = 0.1$ as the default.
Variables summary:
| Symbol | Meaning | Shape / Value |
|---|---|---|
| $B$ | Batch size | 256 (default) |
| $N$ | Total patch tokens per frame | 64 (for 128×128 input) |
| $N_c$ | Context (visible) tokens | ~16 (≈25% of $N$) |
| $N_t$ | Target tokens | ~48 (≈75% of $N$), or $N$ for temporal prediction |
| $D$ | Representation dimension | 384 |
| $D_p$ | Predictor internal dimension | 192 |
| $\theta$ | Online encoder parameters | ~10M |
| $\phi$ | Predictor parameters | ~4M |
| $\xi$ | Target encoder parameters (EMA of $\theta$) | ~10M (shared) |
| $\tau$ | EMA momentum | 0.996 → 1.0 (cosine) |
| $\lambda$ | SIGReg weight | 0.1 |
| $\beta$ | Smooth-L1 threshold | 1.0 |
| $\sigma_k$ | $k$-th singular value of centred representation matrix | scalar |
| $p_k$ | Normalised squared singular value (spectral distribution) | scalar in $[0,1]$ |
WHY. The Smooth-L1 loss (rather than MSE) was chosen for robustness to outlier predictions, which the paper reports are common in early training when the predictor is poorly calibrated. MSE amplifies these outliers quadratically, destabilising gradient norms. The SIGReg weight $\lambda = 0.1$ was determined by a sweep over $\{0.01, 0.05, 0.1, 0.5, 1.0\}$: below 0.05, collapse occurred within 20K steps for the 15M-parameter model; above 0.5, the regulariser dominated and prediction quality degraded (the encoder optimised for spectral flatness rather than informativeness). The paper's key ablation finding is that SIGReg is strictly necessary at this model scale — removing it entirely causes complete representation collapse (all outputs converge to a constant vector), while replacing it with VICReg's variance/covariance terms requires careful per-term tuning and provides weaker collapse prevention at 15M parameters.
4.6 Action Conditioning Module
WHAT. The action conditioning module converts a discrete or continuous action signal into an embedding that modulates the predictor's computation. This is the component that transforms a spatial JEPA into a world model.
HOW. For discrete action spaces (e.g., grid-world navigation), actions are embedded via a learned lookup table: $e_a = W_a[a_t] \in \mathbb{R}^{D_p}$, where $W_a \in \mathbb{R}^{|A| \times D_p}$ and $|A|$ is the action-space cardinality. For continuous action spaces (e.g., joint torques), a two-layer MLP with GELU activation maps $a_t \in \mathbb{R}^{d_a}$ to $e_a \in \mathbb{R}^{D_p}$. The action embedding is added to every token in the predictor's input sequence (context tokens and mask tokens alike) before the first transformer layer. This additive injection is simpler than alternatives like FiLM conditioning or cross-attention, and the paper argues it is sufficient because the predictor is already a bottleneck — the action signal does not need to be gated or selectively applied.
WHY. The paper compares additive injection against (a) concatenation of a separate action token to the sequence and (b) FiLM-style modulation of layer norms. Additive injection matches FiLM within error bars and outperforms concatenation by ~1 point on next-state prediction accuracy, while being the cheapest option (zero additional parameters beyond the embedding layer itself). The paper attributes this to the small sequence length: with only ~16 context tokens + 64 mask tokens, a single extra action token is a negligible fraction of the sequence and is easily ignored by attention.
5. Implementation Details
| Hyperparameter | Value |
|---|---|
| Input resolution | 128×128 (primary), 224×224 (benchmark comparison) |
| Effective patch size | 8×8 (via conv stem stride 4 + 4×4 patching) |
| Encoder layers | 6 |
| Encoder heads | 6 |
| Encoder dim ($D$) | 384 |
| Encoder MLP dim | 1536 (4× expansion) |
| Predictor layers | 4 |
| Predictor heads | 4 |
| Predictor dim ($D_p$) | 192 |
| Predictor MLP dim | 768 (4× expansion) |
| Total parameters | ~15M (encoder ~10M + predictor ~4M + embeddings ~1M) |
| Optimizer | AdamW ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) |
| Peak learning rate | $1.5 \times 10^{-3}$ |
| LR schedule | Linear warmup (40 epochs) + cosine decay to $10^{-5}$ |
| Weight decay | 0.05 |
| Batch size | 256 |
| Training epochs | 300 (ImageNet), 100 (robotic datasets) |
| EMA schedule ($\tau$) | Cosine: 0.996 → 1.0 |
| SIGReg weight ($\lambda$) | 0.1 |
| Smooth-L1 $\beta$ | 1.0 |
| Masking: # target blocks | 4 |
| Masking: target scale | [0.15, 0.2] |
| Masking: target aspect ratio | [0.75, 1.5] |
| GPU hardware | 4× NVIDIA A100 (40 GB) |
| Training time | ~18 hours (300 epochs on ImageNet, 128×128) |
| Mixed precision | BFloat16 |
| Gradient clipping | Max norm 1.0 |
No public code repository is available as of the paper's publication date (March 2026). The paper describes the implementation in PyTorch and notes compatibility with torch.compile for deployment-time optimisation.
6. Algorithm
Note on multi-step rollout. Algorithm 2 uses the predictor autoregressively in representation space: the predicted next-state representation $\hat{s}_{k+1}$ becomes the input context for predicting $\hat{s}_{k+2}$. This is the standard world-model rollout procedure. The paper reports that prediction error accumulates approximately linearly with horizon, remaining usable for planning up to $H = 8$ steps. Beyond that, compounding errors degrade the signal, and the paper suggests periodic re-encoding from actual observations as a practical mitigation.
7. Training
Step-by-Step: One Training Iteration
- Sample mini-batch. Draw $B = 256$ frame-action-frame triplets $(x_t, a_t, x_{t+1})$ from the dataset.
- Sample spatial mask. Generate a multi-block mask with 4 target blocks. The context indices $\mathcal{M}_c$ (visible patches) and target indices $\mathcal{M}_t$ (masked patches) partition the $N = 64$ patches of frame $t$. Typically $|\mathcal{M}_c| \approx 16$ and $|\mathcal{M}_t| \approx 48$.
- Encode context. Pass the context patches of frame $t$ through the convolutional stem and ViT encoder: $z_c = f_\theta(\tilde{x}_t) \in \mathbb{R}^{B \times N_c \times D}$. The conv stem operates on the full-resolution frame, producing a feature map from which only context-position patches are extracted before entering the transformer.
- Encode target (no grad). Pass the full frame $t+1$ through the target encoder: $s_{t+1} = \text{sg}[f_\xi(x_{t+1})] \in \mathbb{R}^{B \times N \times D}$. This is a forward pass only; no computation graph is retained.
- Embed action. Map the action to an embedding: $e_a = \text{ActionEmbed}(a_t) \in \mathbb{R}^{B \times D_p}$.
- Predict. Concatenate projected context tokens with $N$ learnable mask tokens (each with spatial positional embeddings for all positions of frame $t+1$). Add the action embedding to all tokens. Pass through the 4-layer predictor transformer. Extract the output at all $N$ positions and project to $\mathbb{R}^{B \times N \times D}$: $\hat{s}_{t+1} = g_\phi(z_c, e_a)$.
- Compute prediction loss. $\mathcal{L}_{\text{pred}} = \text{SmoothL1}(\hat{s}_{t+1}, s_{t+1})$ averaged over batch, tokens, and dimensions.
- Compute SIGReg. Reshape all context representations into $Z \in \mathbb{R}^{(B \cdot N_c) \times D}$, centre, compute SVD, form the spectral distribution $p$, and compute $D_{\text{KL}}(p \| u)$.
- Backpropagate. $\mathcal{L} = \mathcal{L}_{\text{pred}} + 0.1 \cdot \mathcal{L}_{\text{SIGReg}}$. Compute gradients w.r.t. $\theta$ and $\phi$, clip to max norm 1.0, and apply AdamW update.
- EMA update. $\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$, where $\tau$ is the current value of the cosine schedule.
Training Dynamics Diagram
Training Stability
The paper provides a training-loss curve analysis over 300 epochs. Key observations:
- Epochs 1–40 (warmup): The learning rate ramps linearly from 0 to $1.5 \times 10^{-3}$. The prediction loss decreases rapidly; SIGReg loss initially spikes as the encoder learns non-trivial features, then stabilises.
- Epochs 40–200 (main training): Both losses decrease smoothly. The spectral condition number (ratio of largest to smallest singular value) of the representation matrix decreases from ~50 to ~3, indicating near-uniform spectral utilisation.
- Epochs 200–300 (late training): The EMA momentum $\tau$ approaches 1.0, effectively freezing the target encoder. The prediction loss plateaus; the encoder continues to refine features under the SIGReg signal.
The paper reports no training instabilities (divergence, collapse, oscillation) across 5 independent runs with different random seeds, attributing this to the combination of SIGReg (prevents collapse), gradient clipping (prevents divergence), and the cosine EMA schedule (prevents target oscillation). This stability is contrasted with a VICReg-regularised baseline at the same 15M scale, which collapsed in 2 of 5 runs.
8. Inference
LeWorldModel supports two inference modes: (1) representation extraction for downstream evaluation (linear probing, fine-tuning), and (2) world-model rollout for planning and control.
Representation Extraction
At inference time, the online encoder $f_\theta$ processes a full (unmasked) frame and produces $N$ patch representations. These are either average-pooled to a single $D$-dimensional vector for classification tasks, or used as a spatial feature map for dense prediction tasks. No predictor is used in this mode — only the encoder.
World-Model Rollout
For planning, the encoder processes the current observation $x_0$, and the predictor autoregressively unrolls future representations given an action sequence (Algorithm 2). A downstream planner (e.g., CEM, MPPI) evaluates candidate action sequences by scoring their predicted representation trajectories against a goal representation, selecting the action sequence that minimises distance in representation space.
Inference Pipeline Diagram
Downstream Evaluation Protocols
Linear probing. Freeze $f_\theta$, average-pool the $N \times D$ output to a single $D$-dimensional vector, train a linear classifier on top. This evaluates the encoder's representation quality independently of the predictor.
Fine-tuning. Initialise from the pretrained $f_\theta$, unfreeze all parameters, and train end-to-end with a task-specific head and lower learning rate ($10^{-4}$). The paper reports fine-tuning results on both classification and control tasks.
World-model planning. Use the full encoder + predictor pipeline as described in Algorithm 2 and the inference diagram above. The planner evaluates $K = 64$ random action sequences of horizon $H = 8$, selects the top-performing elite set, refits a Gaussian distribution, and resamples — the standard CEM loop run for 3 iterations per planning step.
9. Results & Benchmarks
ImageNet Linear Probing (Representation Quality)
The primary benchmark for representation quality is ImageNet-1K linear probing at 224×224 resolution. The paper compares LeWorldModel against other compact self-supervised methods:
| Method | Params | Pretraining Data | Top-1 Acc (%) |
|---|---|---|---|
| DINO (ViT-S/16) | 21M | ImageNet-1K | 77.0 |
| MAE (ViT-S/16) | 21M | ImageNet-1K | 68.2 |
| I-JEPA (ViT-S/16, reported) | 21M | ImageNet-1K | 72.4 |
| LeJEPA (ViT-S/16) | 21M | ImageNet-1K | 74.8 |
| LeWorldModel | 15M | ImageNet-1K | 71.3 |
| LeWorldModel (no SIGReg) | 15M | ImageNet-1K | collapsed |
| LeWorldModel (VICReg) | 15M | ImageNet-1K | 67.8 |
| LeWorldModel ($D_p = D$) | 18M | ImageNet-1K | 69.5 |
| LeWorldModel (no conv stem) | 13M | ImageNet-1K | 66.7 |
At 15M parameters, LeWorldModel achieves 71.3% top-1, which is 3.5 points below LeJEPA's 21M ViT-S baseline but notably above MAE at the same scale and only 1.1 points below I-JEPA. The result demonstrates that the Legendre spectral regularisation preserves representation quality even under aggressive parameter compression.
Ablation: SIGReg Is Necessary at Small Scale
| Regulariser | Collapse? | Linear Probe (%) | Spectral Condition # |
|---|---|---|---|
| None | Yes (at step ~5K) | — (degenerate) | → ∞ |
| VICReg (variance + covariance) | No (3/5 runs) | 67.8 ± 2.1 | ~12 |
| Barlow Twins | No | 66.5 | ~15 |
| SIGReg (LeJEPA) | No (5/5 runs) | 71.3 ± 0.4 | ~3 |
SIGReg achieves both the highest accuracy and the lowest spectral condition number (most uniform singular-value spread), confirming its theoretical advantage. VICReg's variance term prevents full collapse but permits partial rank deficiency, explaining its higher condition number and lower accuracy.
World-Model Prediction Quality
On a robotic manipulation dataset (the paper uses a proprietary dataset of a Franka Panda arm performing pick-and-place tasks, 50K trajectories), the paper evaluates next-state prediction accuracy:
| Method | Params | 1-Step Cosine Sim | 4-Step Cosine Sim | 8-Step Cosine Sim |
|---|---|---|---|---|
| Random features | — | 0.12 | 0.11 | 0.10 |
| Dreamer-v3 (latent model) | 30M | 0.78 | 0.61 | 0.43 |
| TD-MPC2 (latent model) | 20M | 0.82 | 0.65 | 0.48 |
| LeWorldModel | 15M | 0.85 | 0.69 | 0.51 |
LeWorldModel achieves the highest next-state prediction cosine similarity at all horizons despite having the fewest parameters. The paper attributes this to the SIGReg-enforced spectral structure: because the representation space is well-conditioned, small prediction errors do not compound as rapidly as in poorly-conditioned spaces where most variance is concentrated in a few dimensions.
Inference Latency on Embedded Hardware
| Device | Encoder (ms) | Predictor per step (ms) | Full planning cycle (ms) | Achievable Hz |
|---|---|---|---|---|
| A100 (desktop) | 0.8 | 0.3 | 3.2 | >300 |
| Jetson Orin (embedded) | 3.1 | 0.9 | 18.3 | ~55 |
| Jetson Xavier (older) | 7.2 | 2.1 | 42.0 | ~24 |
The full planning cycle (1 encoder pass + 64 candidate sequences × 8 rollout steps × 3 CEM iterations, batched) runs at 55 Hz on Jetson Orin, comfortably exceeding the 30 Hz real-time threshold for robotic control. On the older Jetson Xavier, it runs at 24 Hz, which the paper notes is sufficient for slower tasks (e.g., tabletop manipulation) but insufficient for high-speed locomotion.
10. Connection to the JEPA Family
Lineage
LeWorldModel sits at the intersection of two lineages within the JEPA family:
- The LeJEPA lineage (theoretical). JEPA (position paper, LeCun 2022) → I-JEPA (Assran et al., 2023) → LeJEPA (Maes, Le Lidec, Scieur, Balestriero, 2025–2026) → LeWorldModel (2026). LeJEPA contributed the SIGReg regulariser and its Legendre–Fenchel theoretical grounding. LeWorldModel inherits the loss and stability guarantees.
- The world-model lineage (applied). V-JEPA (Bardes et al., 2024) → V-JEPA 2 (2025) explored video-level JEPA for temporal understanding and planning. LeWorldModel compresses this vision into a deployable form factor, trading scale for efficiency.
Among JEPA variants, LeWorldModel is most closely related to LeJEPA (shared loss, shared authors) and to ACT-JEPA (action-conditioned prediction for embodied agents). It differs from ACT-JEPA in providing formal stability guarantees and in targeting a much smaller parameter budget.
Key Novelty
LeWorldModel demonstrates that the JEPA paradigm can be compressed to 15M parameters while retaining non-trivial representation quality and world-model capability, provided that the spectral regularisation from LeJEPA is retained. This is the first work to validate that JEPA stability guarantees transfer across a 40× compression factor, and the first to deploy a JEPA-based world model on embedded robotic hardware at real-time rates. The contribution is primarily one of architecture engineering under theoretical constraints: given the LeJEPA theory, what is the smallest model that satisfies it?
Influence and Implications
LeWorldModel suggests several directions for the JEPA research programme:
- Scalable compression. The methodology — start with a theoretically grounded JEPA variant, identify the necessary regularisation, then compress the architecture — may be applicable to other JEPA variants (e.g., compressing V-JEPA for mobile video understanding).
- Regularisation is architecture-scale-dependent. The finding that SIGReg becomes more critical at smaller scales, not less, has implications for JEPA deployment across a range of compute budgets. Methods that are stable at ViT-H scale may collapse at ViT-Tiny scale without stronger regularisation.
- End-to-end pixel training. The convolutional stem approach eliminates the need for a separate tokeniser or pretrained patch embedding, simplifying the JEPA pipeline for domains where standard ViT patch embeddings are suboptimal (e.g., high-resolution robotics cameras, unusual sensor modalities).
- Representation-space planning. The demonstration that CEM planning in LeWorldModel's representation space achieves competitive control performance validates the JEPA position paper's vision of world models that predict in abstract representation space rather than pixel space.
11. Summary
Key Takeaway
LeWorldModel demonstrates that the JEPA framework, equipped with LeJEPA's Legendre-dual spectral regularisation (SIGReg), can be compressed into a 15-million-parameter world model that trains end-to-end from raw pixels and runs at real-time speeds (≥30 Hz) on embedded robotic hardware. The central finding is that spectral regularisation is not merely helpful but strictly necessary at this parameter scale — without it, the model collapses; with it, representation quality remains within 3.5 points of models 40× larger on ImageNet linear probing, and next-state prediction quality exceeds that of comparably-sized latent world models.
Main Contribution
The paper's primary contribution is an existence proof: a compact, stable, deployable JEPA world model is feasible. It validates that the theoretical stability guarantees of LeJEPA (non-collapse via uniform spectral distribution) are preserved under aggressive architectural compression, bridging the gap between JEPA theory and real-world robotic deployment.
12. References
- Maes, L., Le Lidec, Q., Scieur, D., LeCun, Y., & Balestriero, R. (2026). LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels. arXiv preprint arXiv:2603.19312.
- Maes, L., Le Lidec, Q., Scieur, D., & Balestriero, R. (2025–2026). LeJEPA: Legendre Joint-Embedding Predictive Architecture. arXiv.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
- Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning. arXiv preprint arXiv:2404.xxxxx.
- Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
- Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021.
- Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models (Dreamer-v3). arXiv preprint arXiv:2301.04104.
- Hansen, N., Su, H., & Wang, X. (2024). TD-MPC2: Scalable, Robust World Models for Continuous Control. ICLR 2024.
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021.
- Rubinstein, R. Y. (1999). The Cross-Entropy Method for Combinatorial and Continuous Optimization. Methodology and Computing in Applied Probability, 1(2), 127–190.