LeWorldModel: Legendre World Model — A Compact JEPA for Real-Time Robotic Perception

Survey article on: LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Maes, Le Lidec, Scieur, LeCun, Balestriero — arXiv:2603.19312, March 2026

1. Introduction

Self-supervised world models aspire to learn compressed, predictive representations of sensory streams that are sufficient for planning and control. The Joint-Embedding Predictive Architecture (JEPA) family has established that predicting in representation space rather than pixel space sidesteps intractable generative modelling while still capturing the abstract structure an agent needs. Yet the dominant JEPA variants — I-JEPA, V-JEPA, and their successors — target large-scale pretraining on GPU clusters, producing encoders with hundreds of millions of parameters. Deploying such models on size-, weight-, and power-constrained robotic platforms remains impractical: a ViT-H encoder at 630M parameters cannot run at real-time frame rates on an embedded Jetson or microcontroller-class device.

LeWorldModel addresses this gap directly. It builds on the theoretical scaffolding of LeJEPA (Legendre JEPA), which introduced a spectral regulariser grounded in the Legendre–Fenchel duality to provide formal non-collapse guarantees for JEPA training. Where LeJEPA demonstrated these guarantees in a standard pretraining regime on ImageNet-scale data, LeWorldModel asks: can we compress the entire JEPA pipeline — encoder, target encoder, predictor, and regulariser — into a 15-million-parameter world model that trains end-to-end from raw pixels and runs in real time on embedded hardware?

The contributions of the paper are threefold:

Architectural compression. A redesigned encoder–predictor pair that reduces the full LeJEPA pipeline by roughly 40× in parameter count while retaining the Legendre-dual spectral regularisation (SIGReg) that prevents representational collapse.
End-to-end pixel training. Unlike prior JEPA methods that assume a frozen patch-embedding front-end or a pretrained tokeniser, LeWorldModel trains from raw RGB frames with no auxiliary reconstruction or pixel-level loss, relying solely on the joint-embedding prediction objective.
Embedded deployment. The paper demonstrates real-time inference (≥30 Hz) on NVIDIA Jetson Orin-class hardware, with ablations showing that the Legendre stability guarantees remain intact at the 15M-parameter scale — a regime where prior JEPA methods often collapse or produce degenerate representations.

How LeWorldModel differs from LeJEPA

LeJEPA is a pretraining method: it defines a regularised energy-based objective and validates it on standard vision benchmarks (ImageNet classification, low-shot transfer). It makes no claims about model size, deployment latency, or world-model utility. LeWorldModel is an application architecture: it takes the LeJEPA objective as a fixed loss and co-designs the encoder, predictor, and training schedule to minimise parameter count while preserving the formal stability properties. The relationship is analogous to that between BERT (a pretraining objective) and DistilBERT or TinyBERT (compressed deployable variants): the theoretical framework is inherited, but the engineering contribution is orthogonal.

A second distinction is scope. LeJEPA operates on static images; LeWorldModel extends the framework to short temporal sequences (2–8 frames), adding a lightweight temporal predictor that forecasts future-frame representations given current-frame context and an action embedding. This turns the architecture into a proper world model in the reinforcement-learning sense: given state $s_t$ and action $a_t$, predict the next-state representation $\hat{s}_{t+1}$.

2. Method

Core intuition. Imagine you are building a mental model of a kitchen. A large JEPA model is like memorising every detail of every object — the scratch pattern on the countertop, the exact hue of each tile. LeWorldModel is like sketching a floor plan: it captures where things are, what can move, and what happens when you push something, but it discards pixel-level detail that is irrelevant to action. The Legendre regulariser acts as an information budget: it forces the small encoder to spread its limited capacity across all meaningful dimensions of variation, rather than collapsing onto a few dominant features.

The method has three interlocking ideas:

Idea 1: Compact Encoder with Preserved Spectral Health

Standard JEPA encoders use wide ViT backbones (embedding dimension 768–1280) with 12–32 transformer layers. LeWorldModel replaces this with a thin ViT variant: fewer layers, a narrower embedding dimension, and smaller patch size to compensate for the reduced model capacity. The critical insight is that spectral regularisation (SIGReg) becomes more important, not less, at small scale. Without it, the narrow bottleneck quickly collapses to a low-rank subspace. With it, the singular-value spectrum of the representation matrix remains flat, ensuring that every dimension carries information.

Analogy. Think of SIGReg as a fairness rule for a team of 15 employees (dimensions). In a large company (large model), even without the rule, most employees stay busy because there is enough work. In a tiny startup (15M params), without the rule, one or two people do everything and the rest idle — that is representational collapse. The fairness rule (SIGReg) forces every employee to carry a proportional share of the workload.

Idea 2: End-to-End Pixel Training

Prior JEPA methods typically initialise the patch-embedding layer with a fixed linear projection and train only the transformer blocks. LeWorldModel makes the patch embedding learnable and deep: a small convolutional stem (two 3×3 conv layers with batch normalisation) replaces the single linear projection. This allows the model to learn task-relevant low-level features (edges, textures relevant to physics) from raw pixels without a separate pretraining stage. The entire pipeline — convolutional stem, transformer encoder, predictor — is trained jointly with a single loss.

Idea 3: Temporal Prediction for World Modelling

A static JEPA predicts masked spatial regions within a single frame. LeWorldModel adds a temporal axis: given a context representation from frame $t$ and a discretised action token $a_t$, the predictor forecasts the representation of frame $t+1$. This is not autoregressive generation — there is no pixel decoder. Instead, the target encoder processes the actual future frame, and the predictor's output is compared to this target in representation space. The action conditioning is implemented via a simple additive embedding, keeping the predictor lightweight.

Why representation-space prediction matters for robotics. A pixel-level world model must predict every irrelevant detail (lighting changes, texture aliasing) and is brittle to distributional shift. A representation-space world model only needs to predict what changes matter for downstream tasks. For a robot arm, this means predicting that an object has moved to a new position, not rendering the exact shadow it casts. This is why a 15M-parameter model can be competitive: it is solving a much easier prediction problem.

3. Model Overview

At-a-Glance

Component	Detail
Input	Raw RGB frames, 128×128 or 224×224 resolution
Masking	Spatial block masking on context frame (LeJEPA-style); temporal next-frame prediction with action conditioning
Encoder	Thin ViT with convolutional stem, ~10M params
Target Encoder	EMA copy of encoder (no gradients), exponential schedule $\tau$: 0.996 → 1.0
Predictor	Lightweight transformer (4 layers, narrow dim), ~4M params, action-conditioned
Loss	Smooth-L1 prediction loss + SIGReg spectral regularisation
Key Result	Competitive representation quality at 15M params; ≥30 Hz on Jetson Orin
Total Params	~15M (encoder + predictor; target encoder shares weights via EMA)

Training Architecture Diagram

Figure 1. LeWorldModel training architecture. The online encoder $f_\theta$ (trainable, solid border) processes the masked context frame. The predictor $g_\phi$ (trainable) receives context representations plus an action embedding and produces predicted next-frame representations. The target encoder $f_\xi$ (EMA, dashed border) processes the unmasked future frame with stop-gradient. Loss is computed between predicted and target representations. Gradients flow to both $\theta$ and $\phi$; the target encoder is updated only via exponential moving average.

4. Main Components of LeWorldModel

4.1 Encoder ($f_\theta$)

WHAT. The online encoder maps a raw RGB frame (after spatial masking) to a sequence of patch-level representations. It consists of two sub-modules: a convolutional stem and a Vision Transformer backbone.

HOW. The convolutional stem comprises two 3×3 convolutional layers with stride 2, batch normalisation, and GELU activation. For 128×128 input, this produces a 32×32 feature map, which is then divided into non-overlapping 4×4 patches, yielding $N = 64$ patch tokens (equivalently, an effective patch size of 8×8 at the input level). Each patch token is projected to dimension $D = 384$. The ViT backbone uses 6 transformer layers with 6 attention heads (head dimension 64), MLP expansion ratio 4 (hidden dim 1536), and standard pre-LayerNorm. Total encoder parameters: approximately 10M.

WHY. The convolutional stem is critical for end-to-end pixel training. Ablation results from the paper show that replacing the conv stem with a standard linear patch projection (as in vanilla ViT) causes a 4–6 point drop in downstream linear-probe accuracy at this model scale. The inductive bias of local connectivity helps the small model extract meaningful low-level features without wasting transformer capacity on pixel-level pattern matching. The choice of $D = 384$ (rather than, e.g., 192 or 768) was validated by a sweep: $D = 192$ degraded representation quality significantly (−8 points linear probe), while $D = 768$ increased parameters to ~35M without commensurate gain (+1.2 points), violating the deployment constraint.

4.2 Target Encoder ($f_\xi$ — EMA)

WHAT. The target encoder is a copy of the online encoder whose parameters $\xi$ are updated via exponential moving average (EMA) of the online parameters $\theta$. It processes the unmasked future frame $x_{t+1}$ and produces target representations against which the predictor's output is compared.

HOW. At each training step:

$$\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$$

where $\tau$ follows a cosine schedule from $\tau_0 = 0.996$ to $\tau_1 = 1.0$ over the course of training. No gradients propagate through the target encoder (stop-gradient). The target encoder processes the full, unmasked frame and produces $N$ patch representations, from which only the target-position tokens are extracted for the loss computation.

WHY. The EMA target provides a slowly evolving prediction target that stabilises training. In the small-model regime, the paper reports that removing EMA (i.e., using the online encoder directly as target) causes immediate collapse within the first 5K steps. The cosine schedule for $\tau$ gradually freezes the target encoder as training progresses, providing a curriculum from fast adaptation (early) to stable targets (late). The initial value $\tau_0 = 0.996$ is inherited from LeJEPA and validated as a reasonable default; the paper does not report sensitivity analysis for this specific hyperparameter in the small-model regime.

4.3 Predictor ($g_\phi$)

WHAT. The predictor is a narrow transformer that takes the context-encoder output and an action embedding as input, and predicts the target-encoder representations for the next frame. It is intentionally capacity-limited (a bottleneck) to prevent the predictor from becoming so powerful that the encoder can learn trivial representations.

HOW. The predictor has 4 transformer layers with embedding dimension $D_p = 192$ (half the encoder dimension), 4 attention heads (head dimension 48), and MLP expansion ratio 4. Input to the predictor is a concatenation of: (1) the context tokens $z_c \in \mathbb{R}^{N_c \times D}$, projected to $\mathbb{R}^{N_c \times D_p}$ via a linear layer; (2) learnable mask tokens $m \in \mathbb{R}^{N_t \times D_p}$ with positional embeddings indicating the spatial positions of the target patches; (3) the action embedding $e_a = \text{Embed}(a_t) \in \mathbb{R}^{D_p}$, broadcast and added to all tokens. The output corresponding to mask-token positions is projected back to $\mathbb{R}^{N_t \times D}$ via a linear head. Total predictor parameters: approximately 4M.

WHY. The narrow predictor dimension ($D_p = D/2$) is the primary information bottleneck. The paper's ablation shows that making the predictor as wide as the encoder ($D_p = D = 384$) does not cause full collapse when SIGReg is active, but degrades linear-probe accuracy by ~2 points — the predictor absorbs information that should live in the encoder. Conversely, making it too narrow ($D_p = 96$) impairs prediction quality (−3.5 points). The 4-layer depth was chosen as the minimum that supports the action-conditioned temporal prediction task; a 2-layer predictor was insufficient for multi-step unrolling experiments (discussed in the paper's appendix).

4.4 Masking Strategy

WHAT. LeWorldModel employs a dual masking approach: spatial block masking on the context frame (inherited from I-JEPA/LeJEPA) and temporal masking in the form of next-frame prediction. The context encoder sees a spatially masked version of frame $t$; the predictor must reconstruct full-frame representations for frame $t+1$.

HOW. Spatial masking follows the multi-block strategy from I-JEPA: 4 target blocks are sampled with aspect ratios in $[0.75, 1.5]$ and scale in $[0.15, 0.2]$ of the total patch grid. The context is the complement — all patches not covered by any target block, typically ~25% of patches. For the temporal prediction task, the predictor receives the context patches from frame $t$ and must predict the target-encoder output for all $N$ patches of frame $t+1$. This means the temporal prediction is strictly harder than the spatial prediction: the predictor must reason about both the missing spatial content and the temporal dynamics.

Figure 2. Dual masking strategy. (Left) Frame $t$ is spatially masked: ~75% of patches are dropped; only the visible ~25% context patches are processed by the online encoder. (Right) Frame $t+1$ is processed in full by the EMA target encoder. The predictor must bridge both the spatial gap (missing patches) and the temporal gap (next frame), conditioned on the action.

WHY. The spatial masking serves two purposes in the world-model context: (1) it reduces the computational cost of encoding (only ~25% of patches are processed), which is important at deployment time where inference budget is tight; (2) it forces the encoder to produce representations that are informative enough for the predictor to infer missing content, preventing degenerate features. The paper reports that removing spatial masking (processing all context patches) increases training cost by ~4× and paradoxically degrades representation quality by ~1.5 points, consistent with findings from I-JEPA that the masking-induced information bottleneck is a beneficial regulariser.

4.5 Loss Function

WHAT. The total loss is a weighted combination of a prediction loss and a spectral regularisation term. The prediction loss measures how well the predictor's output matches the target-encoder representations; the regulariser prevents the representation space from collapsing.

HOW. Let $\hat{s}_{t+1} = g_\phi(f_\theta(\tilde{x}_t), a_t) \in \mathbb{R}^{B \times N_t \times D}$ denote the predicted representations, where $\tilde{x}_t$ is the masked context frame, $a_t$ is the action, $B$ is the batch size, $N_t$ is the number of target positions, and $D$ is the representation dimension. Let $s_{t+1} = \text{sg}[f_\xi(x_{t+1})] \in \mathbb{R}^{B \times N \times D}$ denote the stop-gradiented target representations from which we extract the $N_t$ target-position tokens, denoted $s_{t+1}^{(T)} \in \mathbb{R}^{B \times N_t \times D}$.

The prediction loss is the Smooth-L1 (Huber) loss averaged over target tokens:

$$\mathcal{L}_{\text{pred}} = \frac{1}{B \cdot N_t} \sum_{b=1}^{B} \sum_{j=1}^{N_t} \text{SmoothL1}\!\left(\hat{s}_{t+1}^{(b,j)},\; s_{t+1}^{(T,b,j)}\right)$$

where $\text{SmoothL1}$ is applied element-wise across the $D$ dimensions and summed:

$$\text{SmoothL1}(u, v) = \sum_{d=1}^{D} \begin{cases} \frac{1}{2\beta}(u_d - v_d)^2 & \text{if } |u_d - v_d| < \beta \\ |u_d - v_d| - \frac{\beta}{2} & \text{otherwise} \end{cases}$$

with $\beta = 1.0$ (the default Huber threshold).

The SIGReg (Spectral Information Gain Regulariser) is the key contribution inherited from LeJEPA. It operates on the batch-level representation matrix. Let $Z \in \mathbb{R}^{(B \cdot N_c) \times D}$ be the matrix of all context representations in a batch (reshaped from the encoder output). First, $Z$ is centred: $\bar{Z} = Z - \frac{1}{B \cdot N_c}\mathbf{1}\mathbf{1}^\top Z$. Then compute the singular values $\sigma_1, \sigma_2, \ldots, \sigma_D$ of $\bar{Z}$. Normalise them to form a probability distribution:

$$p_k = \frac{\sigma_k^2}{\sum_{d=1}^{D} \sigma_d^2}, \quad k = 1, \ldots, D$$

The SIGReg loss is the KL divergence between this distribution and the uniform distribution $u_k = 1/D$:

$$\mathcal{L}_{\text{SIGReg}} = D_{\text{KL}}(p \| u) = \sum_{k=1}^{D} p_k \log(D \cdot p_k)$$

This is minimised when $p = u$, i.e., when all singular values are equal (the representation matrix has full rank with a flat spectrum). The connection to Legendre–Fenchel duality, derived in the LeJEPA paper, shows that this regulariser is the convex conjugate of a log-determinant barrier on the covariance matrix, providing a principled information-theoretic motivation rather than an ad-hoc spectral penalty.

The total loss is:

$$\mathcal{L} = \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$$

where $\lambda$ controls the regularisation strength. The paper uses $\lambda = 0.1$ as the default.

Variables summary:

Symbol	Meaning	Shape / Value
$B$	Batch size	256 (default)
$N$	Total patch tokens per frame	64 (for 128×128 input)
$N_c$	Context (visible) tokens	~16 (≈25% of $N$)
$N_t$	Target tokens	~48 (≈75% of $N$), or $N$ for temporal prediction
$D$	Representation dimension	384
$D_p$	Predictor internal dimension	192
$\theta$	Online encoder parameters	~10M
$\phi$	Predictor parameters	~4M
$\xi$	Target encoder parameters (EMA of $\theta$)	~10M (shared)
$\tau$	EMA momentum	0.996 → 1.0 (cosine)
$\lambda$	SIGReg weight	0.1
$\beta$	Smooth-L1 threshold	1.0
$\sigma_k$	$k$-th singular value of centred representation matrix	scalar
$p_k$	Normalised squared singular value (spectral distribution)	scalar in $[0,1]$

WHY. The Smooth-L1 loss (rather than MSE) was chosen for robustness to outlier predictions, which the paper reports are common in early training when the predictor is poorly calibrated. MSE amplifies these outliers quadratically, destabilising gradient norms. The SIGReg weight $\lambda = 0.1$ was determined by a sweep over $\{0.01, 0.05, 0.1, 0.5, 1.0\}$: below 0.05, collapse occurred within 20K steps for the 15M-parameter model; above 0.5, the regulariser dominated and prediction quality degraded (the encoder optimised for spectral flatness rather than informativeness). The paper's key ablation finding is that SIGReg is strictly necessary at this model scale — removing it entirely causes complete representation collapse (all outputs converge to a constant vector), while replacing it with VICReg's variance/covariance terms requires careful per-term tuning and provides weaker collapse prevention at 15M parameters.

4.6 Action Conditioning Module

WHAT. The action conditioning module converts a discrete or continuous action signal into an embedding that modulates the predictor's computation. This is the component that transforms a spatial JEPA into a world model.

HOW. For discrete action spaces (e.g., grid-world navigation), actions are embedded via a learned lookup table: $e_a = W_a[a_t] \in \mathbb{R}^{D_p}$, where $W_a \in \mathbb{R}^{|A| \times D_p}$ and $|A|$ is the action-space cardinality. For continuous action spaces (e.g., joint torques), a two-layer MLP with GELU activation maps $a_t \in \mathbb{R}^{d_a}$ to $e_a \in \mathbb{R}^{D_p}$. The action embedding is added to every token in the predictor's input sequence (context tokens and mask tokens alike) before the first transformer layer. This additive injection is simpler than alternatives like FiLM conditioning or cross-attention, and the paper argues it is sufficient because the predictor is already a bottleneck — the action signal does not need to be gated or selectively applied.

WHY. The paper compares additive injection against (a) concatenation of a separate action token to the sequence and (b) FiLM-style modulation of layer norms. Additive injection matches FiLM within error bars and outperforms concatenation by ~1 point on next-state prediction accuracy, while being the cheapest option (zero additional parameters beyond the embedding layer itself). The paper attributes this to the small sequence length: with only ~16 context tokens + 64 mask tokens, a single extra action token is a negligible fraction of the sequence and is easily ignored by attention.

5. Implementation Details

Hyperparameter	Value
Input resolution	128×128 (primary), 224×224 (benchmark comparison)
Effective patch size	8×8 (via conv stem stride 4 + 4×4 patching)
Encoder layers	6
Encoder heads	6
Encoder dim ($D$)	384
Encoder MLP dim	1536 (4× expansion)
Predictor layers	4
Predictor heads	4
Predictor dim ($D_p$)	192
Predictor MLP dim	768 (4× expansion)
Total parameters	~15M (encoder ~10M + predictor ~4M + embeddings ~1M)
Optimizer	AdamW ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$)
Peak learning rate	$1.5 \times 10^{-3}$
LR schedule	Linear warmup (40 epochs) + cosine decay to $10^{-5}$
Weight decay	0.05
Batch size	256
Training epochs	300 (ImageNet), 100 (robotic datasets)
EMA schedule ($\tau$)	Cosine: 0.996 → 1.0
SIGReg weight ($\lambda$)	0.1
Smooth-L1 $\beta$	1.0
Masking: # target blocks	4
Masking: target scale	[0.15, 0.2]
Masking: target aspect ratio	[0.75, 1.5]
GPU hardware	4× NVIDIA A100 (40 GB)
Training time	~18 hours (300 epochs on ImageNet, 128×128)
Mixed precision	BFloat16
Gradient clipping	Max norm 1.0

No public code repository is available as of the paper's publication date (March 2026). The paper describes the implementation in PyTorch and notes compatibility with torch.compile for deployment-time optimisation.

6. Algorithm

Algorithm 1: LeWorldModel Training (Single Epoch)

Input: Dataset $\mathcal{D} = \{(x_t, a_t, x_{t+1})\}$ of frame-action-frame triplets

Input: Online encoder $f_\theta$, predictor $g_\phi$, target encoder $f_\xi$

Input: EMA momentum schedule $\tau(t)$, SIGReg weight $\lambda$, learning rate $\eta$

Output: Updated parameters $\theta, \phi$

1 for each mini-batch $\{(x_t^{(b)}, a_t^{(b)}, x_{t+1}^{(b)})\}_{b=1}^{B}$ do

2 // Sample spatial mask for context frame

3 $\mathcal{M}_c, \mathcal{M}_t \leftarrow \text{SampleBlockMask}(\text{num\_blocks}=4, \text{scale}=[0.15, 0.2], \text{ratio}=[0.75, 1.5])$

4 // $\mathcal{M}_c$: context (visible) indices, $\mathcal{M}_t$: target indices, $\mathcal{M}_c \cup \mathcal{M}_t = \{1,...,N\}$

5 // Online encoder: process masked context patches of frame t

6 $\tilde{x}_t^{(b)} \leftarrow \text{ApplyMask}(x_t^{(b)}, \mathcal{M}_c)$ // keep only context patches

7 $z_c^{(b)} \leftarrow f_\theta(\tilde{x}_t^{(b)}) \in \mathbb{R}^{N_c \times D}$

8 // Target encoder: process full frame t+1 (no gradient)

9 with no_grad():

10 $s_{t+1}^{(b)} \leftarrow f_\xi(x_{t+1}^{(b)}) \in \mathbb{R}^{N \times D}$

11 // Predictor: predict next-frame representations from context + action

12 $e_a^{(b)} \leftarrow \text{ActionEmbed}(a_t^{(b)}) \in \mathbb{R}^{D_p}$

13 $\hat{s}_{t+1}^{(b)} \leftarrow g_\phi(z_c^{(b)}, e_a^{(b)}, \text{pos\_targets}=\{1,...,N\}) \in \mathbb{R}^{N \times D}$

// Predictor receives projected context + learnable mask tokens for all N positions

// + action embedding added to all tokens; outputs prediction for all N positions

14 // Compute prediction loss (Smooth-L1)

15 $\mathcal{L}_{\text{pred}} \leftarrow \frac{1}{B \cdot N} \sum_{b,j} \text{SmoothL1}(\hat{s}_{t+1}^{(b,j)}, s_{t+1}^{(b,j)})$

16 // Compute SIGReg on encoder output

17 $Z \leftarrow \text{Reshape}(\{z_c^{(b)}\}_{b=1}^{B}) \in \mathbb{R}^{(B \cdot N_c) \times D}$

18 $\bar{Z} \leftarrow Z - \text{mean}(Z, \text{dim}=0)$ // centre across batch

19 $\sigma_1, ..., \sigma_D \leftarrow \text{SVD}(\bar{Z})$ // singular values

20 $p_k \leftarrow \sigma_k^2 / \sum_d \sigma_d^2$ for $k = 1,...,D$

21 $\mathcal{L}_{\text{SIGReg}} \leftarrow \sum_{k=1}^{D} p_k \log(D \cdot p_k)$

22 // Total loss and gradient step

23 $\mathcal{L} \leftarrow \mathcal{L}_{\text{pred}} + \lambda \cdot \mathcal{L}_{\text{SIGReg}}$

24 $\theta, \phi \leftarrow \text{AdamW}(\nabla_{\theta, \phi} \mathcal{L}, \eta)$ // gradient step with clipping

25 // EMA update of target encoder

26 $\xi \leftarrow \tau(t) \cdot \xi + (1 - \tau(t)) \cdot \theta$

27 end for

Algorithm 2: LeWorldModel Multi-Step Rollout (World-Model Inference)

Input: Initial frame $x_0$, action sequence $(a_0, a_1, ..., a_{H-1})$, horizon $H$

Input: Trained encoder $f_\theta$, trained predictor $g_\phi$

Output: Predicted representation trajectory $(\hat{s}_1, \hat{s}_2, ..., \hat{s}_H)$

1 // Encode initial frame (optionally with or without masking)

2 $z_0 \leftarrow f_\theta(x_0) \in \mathbb{R}^{N \times D}$ // full frame, no masking at inference

3 for $k = 0, 1, ..., H-1$ do

4 $e_a \leftarrow \text{ActionEmbed}(a_k) \in \mathbb{R}^{D_p}$

5 if $k = 0$ then

6 $\hat{s}_{k+1} \leftarrow g_\phi(z_0, e_a, \text{pos\_targets}=\{1,...,N\})$

7 else

8 $\hat{s}_{k+1} \leftarrow g_\phi(\hat{s}_k, e_a, \text{pos\_targets}=\{1,...,N\})$

// Feed previous prediction as context (autoregressive in repr space)

9 end if

10 end for

11 return $(\hat{s}_1, \hat{s}_2, ..., \hat{s}_H)$

Note on multi-step rollout. Algorithm 2 uses the predictor autoregressively in representation space: the predicted next-state representation $\hat{s}_{k+1}$ becomes the input context for predicting $\hat{s}_{k+2}$. This is the standard world-model rollout procedure. The paper reports that prediction error accumulates approximately linearly with horizon, remaining usable for planning up to $H = 8$ steps. Beyond that, compounding errors degrade the signal, and the paper suggests periodic re-encoding from actual observations as a practical mitigation.

7. Training

Step-by-Step: One Training Iteration

Sample mini-batch. Draw $B = 256$ frame-action-frame triplets $(x_t, a_t, x_{t+1})$ from the dataset.
Sample spatial mask. Generate a multi-block mask with 4 target blocks. The context indices $\mathcal{M}_c$ (visible patches) and target indices $\mathcal{M}_t$ (masked patches) partition the $N = 64$ patches of frame $t$. Typically $|\mathcal{M}_c| \approx 16$ and $|\mathcal{M}_t| \approx 48$.
Encode context. Pass the context patches of frame $t$ through the convolutional stem and ViT encoder: $z_c = f_\theta(\tilde{x}_t) \in \mathbb{R}^{B \times N_c \times D}$. The conv stem operates on the full-resolution frame, producing a feature map from which only context-position patches are extracted before entering the transformer.
Encode target (no grad). Pass the full frame $t+1$ through the target encoder: $s_{t+1} = \text{sg}[f_\xi(x_{t+1})] \in \mathbb{R}^{B \times N \times D}$. This is a forward pass only; no computation graph is retained.
Embed action. Map the action to an embedding: $e_a = \text{ActionEmbed}(a_t) \in \mathbb{R}^{B \times D_p}$.
Predict. Concatenate projected context tokens with $N$ learnable mask tokens (each with spatial positional embeddings for all positions of frame $t+1$). Add the action embedding to all tokens. Pass through the 4-layer predictor transformer. Extract the output at all $N$ positions and project to $\mathbb{R}^{B \times N \times D}$: $\hat{s}_{t+1} = g_\phi(z_c, e_a)$.
Compute prediction loss. $\mathcal{L}_{\text{pred}} = \text{SmoothL1}(\hat{s}_{t+1}, s_{t+1})$ averaged over batch, tokens, and dimensions.
Compute SIGReg. Reshape all context representations into $Z \in \mathbb{R}^{(B \cdot N_c) \times D}$, centre, compute SVD, form the spectral distribution $p$, and compute $D_{\text{KL}}(p \| u)$.
Backpropagate. $\mathcal{L} = \mathcal{L}_{\text{pred}} + 0.1 \cdot \mathcal{L}_{\text{SIGReg}}$. Compute gradients w.r.t. $\theta$ and $\phi$, clip to max norm 1.0, and apply AdamW update.
EMA update. $\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$, where $\tau$ is the current value of the cosine schedule.

Training Dynamics Diagram

Figure 3. Detailed training iteration. Data flows left-to-right: mini-batch → masking → online encoder (trainable) and target encoder (EMA, no grad) → predictor → loss computation → backpropagation → parameter update → EMA update. Green dashed lines indicate gradient flow; dark dashed lines indicate the gradient-free EMA pathway. Dimension annotations show tensor shapes at each stage for $B=256$, $N=64$, $N_c \approx 16$, $D=384$.

Training Stability

The paper provides a training-loss curve analysis over 300 epochs. Key observations:

Epochs 1–40 (warmup): The learning rate ramps linearly from 0 to $1.5 \times 10^{-3}$. The prediction loss decreases rapidly; SIGReg loss initially spikes as the encoder learns non-trivial features, then stabilises.
Epochs 40–200 (main training): Both losses decrease smoothly. The spectral condition number (ratio of largest to smallest singular value) of the representation matrix decreases from ~50 to ~3, indicating near-uniform spectral utilisation.
Epochs 200–300 (late training): The EMA momentum $\tau$ approaches 1.0, effectively freezing the target encoder. The prediction loss plateaus; the encoder continues to refine features under the SIGReg signal.

The paper reports no training instabilities (divergence, collapse, oscillation) across 5 independent runs with different random seeds, attributing this to the combination of SIGReg (prevents collapse), gradient clipping (prevents divergence), and the cosine EMA schedule (prevents target oscillation). This stability is contrasted with a VICReg-regularised baseline at the same 15M scale, which collapsed in 2 of 5 runs.

8. Inference

LeWorldModel supports two inference modes: (1) representation extraction for downstream evaluation (linear probing, fine-tuning), and (2) world-model rollout for planning and control.

Representation Extraction

At inference time, the online encoder $f_\theta$ processes a full (unmasked) frame and produces $N$ patch representations. These are either average-pooled to a single $D$-dimensional vector for classification tasks, or used as a spatial feature map for dense prediction tasks. No predictor is used in this mode — only the encoder.

World-Model Rollout

For planning, the encoder processes the current observation $x_0$, and the predictor autoregressively unrolls future representations given an action sequence (Algorithm 2). A downstream planner (e.g., CEM, MPPI) evaluates candidate action sequences by scoring their predicted representation trajectories against a goal representation, selecting the action sequence that minimises distance in representation space.

Inference Pipeline Diagram

Figure 4. Inference pipeline. Mode A (top): representation extraction for downstream evaluation — the encoder processes a full unmasked frame, representations are pooled and fed to a linear or MLP probe. Mode B (bottom): world-model rollout for planning — the encoder processes the initial frame, and the predictor autoregressively unrolls future representations given candidate action sequences. A planner (CEM/MPPI) selects the action sequence whose predicted trajectory best matches the goal representation.

Downstream Evaluation Protocols

Linear probing. Freeze $f_\theta$, average-pool the $N \times D$ output to a single $D$-dimensional vector, train a linear classifier on top. This evaluates the encoder's representation quality independently of the predictor.

Fine-tuning. Initialise from the pretrained $f_\theta$, unfreeze all parameters, and train end-to-end with a task-specific head and lower learning rate ($10^{-4}$). The paper reports fine-tuning results on both classification and control tasks.

World-model planning. Use the full encoder + predictor pipeline as described in Algorithm 2 and the inference diagram above. The planner evaluates $K = 64$ random action sequences of horizon $H = 8$, selects the top-performing elite set, refits a Gaussian distribution, and resamples — the standard CEM loop run for 3 iterations per planning step.

9. Results & Benchmarks

ImageNet Linear Probing (Representation Quality)

The primary benchmark for representation quality is ImageNet-1K linear probing at 224×224 resolution. The paper compares LeWorldModel against other compact self-supervised methods:

Method	Params	Pretraining Data	Top-1 Acc (%)
DINO (ViT-S/16)	21M	ImageNet-1K	77.0
MAE (ViT-S/16)	21M	ImageNet-1K	68.2
I-JEPA (ViT-S/16, reported)	21M	ImageNet-1K	72.4
LeJEPA (ViT-S/16)	21M	ImageNet-1K	74.8
LeWorldModel	15M	ImageNet-1K	71.3
LeWorldModel (no SIGReg)	15M	ImageNet-1K	collapsed
LeWorldModel (VICReg)	15M	ImageNet-1K	67.8
LeWorldModel ($D_p = D$)	18M	ImageNet-1K	69.5
LeWorldModel (no conv stem)	13M	ImageNet-1K	66.7

At 15M parameters, LeWorldModel achieves 71.3% top-1, which is 3.5 points below LeJEPA's 21M ViT-S baseline but notably above MAE at the same scale and only 1.1 points below I-JEPA. The result demonstrates that the Legendre spectral regularisation preserves representation quality even under aggressive parameter compression.

Ablation: SIGReg Is Necessary at Small Scale

Regulariser	Collapse?	Linear Probe (%)	Spectral Condition #
None	Yes (at step ~5K)	— (degenerate)	→ ∞
VICReg (variance + covariance)	No (3/5 runs)	67.8 ± 2.1	~12
Barlow Twins	No	66.5	~15
SIGReg (LeJEPA)	No (5/5 runs)	71.3 ± 0.4	~3

SIGReg achieves both the highest accuracy and the lowest spectral condition number (most uniform singular-value spread), confirming its theoretical advantage. VICReg's variance term prevents full collapse but permits partial rank deficiency, explaining its higher condition number and lower accuracy.

World-Model Prediction Quality

On a robotic manipulation dataset (the paper uses a proprietary dataset of a Franka Panda arm performing pick-and-place tasks, 50K trajectories), the paper evaluates next-state prediction accuracy:

Method	Params	1-Step Cosine Sim	4-Step Cosine Sim	8-Step Cosine Sim
Random features	—	0.12	0.11	0.10
Dreamer-v3 (latent model)	30M	0.78	0.61	0.43
TD-MPC2 (latent model)	20M	0.82	0.65	0.48
LeWorldModel	15M	0.85	0.69	0.51

LeWorldModel achieves the highest next-state prediction cosine similarity at all horizons despite having the fewest parameters. The paper attributes this to the SIGReg-enforced spectral structure: because the representation space is well-conditioned, small prediction errors do not compound as rapidly as in poorly-conditioned spaces where most variance is concentrated in a few dimensions.

Inference Latency on Embedded Hardware

Device	Encoder (ms)	Predictor per step (ms)	Full planning cycle (ms)	Achievable Hz
A100 (desktop)	0.8	0.3	3.2	>300
Jetson Orin (embedded)	3.1	0.9	18.3	~55
Jetson Xavier (older)	7.2	2.1	42.0	~24

The full planning cycle (1 encoder pass + 64 candidate sequences × 8 rollout steps × 3 CEM iterations, batched) runs at 55 Hz on Jetson Orin, comfortably exceeding the 30 Hz real-time threshold for robotic control. On the older Jetson Xavier, it runs at 24 Hz, which the paper notes is sufficient for slower tasks (e.g., tabletop manipulation) but insufficient for high-speed locomotion.

10. Connection to the JEPA Family

Lineage

LeWorldModel sits at the intersection of two lineages within the JEPA family:

The LeJEPA lineage (theoretical). JEPA (position paper, LeCun 2022) → I-JEPA (Assran et al., 2023) → LeJEPA (Maes, Le Lidec, Scieur, Balestriero, 2025–2026) → LeWorldModel (2026). LeJEPA contributed the SIGReg regulariser and its Legendre–Fenchel theoretical grounding. LeWorldModel inherits the loss and stability guarantees.
The world-model lineage (applied). V-JEPA (Bardes et al., 2024) → V-JEPA 2 (2025) explored video-level JEPA for temporal understanding and planning. LeWorldModel compresses this vision into a deployable form factor, trading scale for efficiency.

Among JEPA variants, LeWorldModel is most closely related to LeJEPA (shared loss, shared authors) and to ACT-JEPA (action-conditioned prediction for embodied agents). It differs from ACT-JEPA in providing formal stability guarantees and in targeting a much smaller parameter budget.

Key Novelty

LeWorldModel demonstrates that the JEPA paradigm can be compressed to 15M parameters while retaining non-trivial representation quality and world-model capability, provided that the spectral regularisation from LeJEPA is retained. This is the first work to validate that JEPA stability guarantees transfer across a 40× compression factor, and the first to deploy a JEPA-based world model on embedded robotic hardware at real-time rates. The contribution is primarily one of architecture engineering under theoretical constraints: given the LeJEPA theory, what is the smallest model that satisfies it?

Influence and Implications

LeWorldModel suggests several directions for the JEPA research programme:

Scalable compression. The methodology — start with a theoretically grounded JEPA variant, identify the necessary regularisation, then compress the architecture — may be applicable to other JEPA variants (e.g., compressing V-JEPA for mobile video understanding).
Regularisation is architecture-scale-dependent. The finding that SIGReg becomes more critical at smaller scales, not less, has implications for JEPA deployment across a range of compute budgets. Methods that are stable at ViT-H scale may collapse at ViT-Tiny scale without stronger regularisation.
End-to-end pixel training. The convolutional stem approach eliminates the need for a separate tokeniser or pretrained patch embedding, simplifying the JEPA pipeline for domains where standard ViT patch embeddings are suboptimal (e.g., high-resolution robotics cameras, unusual sensor modalities).
Representation-space planning. The demonstration that CEM planning in LeWorldModel's representation space achieves competitive control performance validates the JEPA position paper's vision of world models that predict in abstract representation space rather than pixel space.

11. Summary

Key Takeaway

LeWorldModel demonstrates that the JEPA framework, equipped with LeJEPA's Legendre-dual spectral regularisation (SIGReg), can be compressed into a 15-million-parameter world model that trains end-to-end from raw pixels and runs at real-time speeds (≥30 Hz) on embedded robotic hardware. The central finding is that spectral regularisation is not merely helpful but strictly necessary at this parameter scale — without it, the model collapses; with it, representation quality remains within 3.5 points of models 40× larger on ImageNet linear probing, and next-state prediction quality exceeds that of comparably-sized latent world models.

Main Contribution

The paper's primary contribution is an existence proof: a compact, stable, deployable JEPA world model is feasible. It validates that the theoretical stability guarantees of LeJEPA (non-collapse via uniform spectral distribution) are preserved under aggressive architectural compression, bridging the gap between JEPA theory and real-world robotic deployment.

12. References

Maes, L., Le Lidec, Q., Scieur, D., LeCun, Y., & Balestriero, R. (2026). LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels. arXiv preprint arXiv:2603.19312.
Maes, L., Le Lidec, Q., Scieur, D., & Balestriero, R. (2025–2026). LeJEPA: Legendre Joint-Embedding Predictive Architecture. arXiv.
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., & Ballas, N. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning. arXiv preprint arXiv:2404.xxxxx.
Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR 2022.
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models (Dreamer-v3). arXiv preprint arXiv:2301.04104.
Hansen, N., Su, H., & Wang, X. (2024). TD-MPC2: Scalable, Robust World Models for Continuous Control. ICLR 2024.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021.
Rubinstein, R. Y. (1999). The Cross-Entropy Method for Combinatorial and Continuous Optimization. Methodology and Computing in Applied Probability, 1(2), 127–190.

@misc{kinas2026jepa,
  author = {Kinas, Remek},
  title  = {JEPA Survey},
  year   = {2026},
  url    = {https://jepa.si5.pl}
}