Motivation
Self-supervised learning for tabular data is hampered by the absence of natural augmentations. In vision, crops and color jitter produce semantically equivalent views, but corrupting or permuting features in a table has no principled, label-preserving analogue — such transformations can change the meaning of a row. Contrastive tabular SSL therefore relies on ad-hoc, often harmful augmentations. This T-JEPA seeks augmentation-free tabular pretraining, deriving the self-supervisory signal from the structure of the features themselves rather than from invented views.
How it works
For each sample, the set of features is partitioned: a subset forms the context and a disjoint subset forms the target, so the masking unit is a group of features (columns) within a row.
- A context encoder $f_\theta$ embeds the observed (context) features.
- A predictor $g_\phi$, conditioned on which target features are sought, predicts their latent representations.
- An EMA target encoder $\bar f_\theta$ embeds the target features to provide the targets, with gradients stopped.
Training minimizes a latent-space distance — no reconstruction of raw feature values, no contrastive negatives, and crucially no data augmentation. Predicting one feature group from another forces the encoder to model inter-feature dependencies directly.
The objective
Let $x_C$ be the context features and $x_T$ the disjoint target features identified by conditioning $c_T$. The loss is the latent $\ell_2$ distance:
$$\mathcal{L} = \big\lVert\, g_\phi\big(f_\theta(x_C),\, c_T\big) - \operatorname{sg}\big[\bar f_\theta(x_T)\big]\,\big\rVert_2^2,$$
with $\operatorname{sg}$ the stop-gradient and the target encoder updated by EMA. The conditioning $c_T$ tells the predictor which features it must infer, so a single model predicts arbitrary held-out feature subsets, capturing the dependency structure among heterogeneous columns without any augmentation.
Key results & what's novel
T-JEPA brings the joint-embedding predictive paradigm to one of the hardest SSL domains. By predicting the embeddings of masked feature subsets from other features, it captures inter-feature dependencies directly and provides a self-supervisory signal for heterogeneous tabular data without inventing artificial augmentations. It is competitive on downstream tabular classification and regression, and demonstrates that JEPA's masked latent prediction generalizes beyond spatial or sequential data to non-spatial, non-sequential feature sets — extending the family to a structurally distinct modality.
Strengths & limitations
- + Augmentation-free: avoids the label-distorting transforms that plague tabular contrastive SSL.
- + Directly models inter-feature dependencies via column-subset prediction.
- + Generalizes JEPA to non-spatial, non-sequential feature sets.
- − Choice of context/target feature partition is a sensitive design lever.
- − Heterogeneous mixed-type columns (categorical + numeric) complicate tokenization and embedding.
- − A representation learner; gains depend on dataset structure and may be modest where features are weakly dependent.