T-JEPA (Tabular) — World Modeling

At a glance

ProblemTabular SSL lacks natural augmentations — corrupting or permuting features has no principled, label-preserving analogue to image crops.

Key ideaAugmentation-free JEPA for tables: partition a row's features into context and target subsets and predict the target features' latents.

ModalityTabular data (rows of heterogeneous features)

Target / maskingA disjoint subset of features (columns) within a row; targets are EMA latents of those features.

Builds onI-JEPA latent masked prediction adapted to feature subsets.

Used forTabular classification and regression representation learning.

Motivation

Self-supervised learning for tabular data is hampered by the absence of natural augmentations. In vision, crops and color jitter produce semantically equivalent views, but corrupting or permuting features in a table has no principled, label-preserving analogue — such transformations can change the meaning of a row. Contrastive tabular SSL therefore relies on ad-hoc, often harmful augmentations. This T-JEPA seeks augmentation-free tabular pretraining, deriving the self-supervisory signal from the structure of the features themselves rather than from invented views.

How it works

Canonical JEPA schematic for Tabular row. The input is split into a visible context and hidden targets (feature subset-level, feature subset). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance.

For each sample, the set of features is partitioned: a subset forms the context and a disjoint subset forms the target, so the masking unit is a group of features (columns) within a row.

A context encoder $f_\theta$ embeds the observed (context) features.
A predictor $g_\phi$, conditioned on which target features are sought, predicts their latent representations.
An EMA target encoder $\bar f_\theta$ embeds the target features to provide the targets, with gradients stopped.

Training minimizes a latent-space distance — no reconstruction of raw feature values, no contrastive negatives, and crucially no data augmentation. Predicting one feature group from another forces the encoder to model inter-feature dependencies directly.

The objective

Let $x_C$ be the context features and $x_T$ the disjoint target features identified by conditioning $c_T$. The loss is the latent $\ell_2$ distance:

$$\mathcal{L} = \big\lVert\, g_\phi\big(f_\theta(x_C),\, c_T\big) - \operatorname{sg}\big[\bar f_\theta(x_T)\big]\,\big\rVert_2^2,$$

with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The conditioning $c_T$ tells the predictor which features it must infer, so a single model predicts arbitrary held-out feature subsets, capturing the dependency structure among heterogeneous columns without any augmentation.

Key results & what's novel

T-JEPA brings the joint-embedding predictive paradigm to one of the hardest SSL domains. By predicting the embeddings of masked feature subsets from other features, it captures inter-feature dependencies directly and provides a self-supervisory signal for heterogeneous tabular data without inventing artificial augmentations. It is competitive on downstream tabular classification and regression, and demonstrates that JEPA's masked latent prediction generalizes beyond spatial or sequential data to non-spatial, non-sequential feature sets — extending the family to a structurally distinct modality.

Strengths & limitations

+ Augmentation-free: avoids the label-distorting transforms that plague tabular contrastive SSL.
+ Directly models inter-feature dependencies via column-subset prediction.
+ Generalizes JEPA to non-spatial, non-sequential feature sets.
− Choice of context/target feature partition is a sensitive design lever.
− Heterogeneous mixed-type columns (categorical + numeric) complicate tokenization and embedding.
− A representation learner; gains depend on dataset structure and may be modest where features are weakly dependent.

Connections & references

Builds onI-JEPA

Paper ↗