S-JEPA: Skeletal Joint Embedding Predictive Architecture
1. Introduction
Self-supervised learning for skeleton-based action recognition has followed two dominant paradigms: contrastive learning and generative masked modeling. Contrastive methods such as 3s-CrosSCLR, AimCLR, and HiCLR construct positive and negative skeleton pairs through augmentation pipelines and learn representations by pulling similar pairs together while pushing dissimilar ones apart. These approaches require careful augmentation design, large batch sizes or memory banks for sufficient negatives, and are sensitive to the choice of data transformations—particularly problematic for skeleton data, where aggressive augmentations can destroy the biomechanical plausibility of a pose sequence. Generative masked approaches such as SkeletonMAE and MAMP mask portions of the skeleton input and reconstruct raw 3D joint coordinates. While these avoid the negative-pair problem, they force the encoder to devote representational capacity to low-level spatial details—exact joint positions, bone lengths, and coordinate noise—that are largely irrelevant for downstream semantic tasks like action classification.
S-JEPA (Skeletal Joint Embedding Predictive Architecture) addresses both limitations by transposing the JEPA framework, originally developed for images in I-JEPA, to the skeleton domain. The core insight is simple but powerful: instead of predicting raw joint coordinates (input space), predict the latent representations of masked joints as produced by an exponential moving average (EMA) target encoder (representation space). This latent prediction objective discards unpredictable low-level detail and focuses the encoder on learning abstract, semantically meaningful features of human motion.
S-JEPA introduces three key innovations beyond the direct transposition of I-JEPA:
- Motion-aware spatial masking. Rather than masking random blocks (as in I-JEPA for images), S-JEPA computes per-joint motion magnitudes across the temporal sequence and biases masking toward high-motion joints. With a masking ratio of $r = 0.9$, the model retains only a handful of low-motion joints (e.g., torso, hips) and must predict the latent representations of the most informative, action-discriminative joints (e.g., hands, feet). This creates a harder and more semantically informative prediction task than uniform random masking.
- Cross-entropy loss with centering and sharpening. While I-JEPA uses an $L_2$ loss between predicted and target representations, S-JEPA treats encoder outputs as logits over a learned feature space, converts them to probability distributions via temperature-scaled softmax, and minimizes cross-entropy. A centering mechanism (running mean subtraction on target outputs) and asymmetric temperature sharpening prevent representational collapse without requiring negative pairs or explicit variance regularization.
- Geometric view transformations. S-JEPA applies random 3D geometric transformations (rotation, scaling, translation, reflection) to generate diverse views of each skeleton sequence. The view encoder receives one augmented view while the target encoder receives another, encouraging the learned representations to be invariant to viewpoint and scale changes—critical for skeleton-based recognition where camera placement varies across datasets.
Compared to I-JEPA, S-JEPA operates on structured spatiotemporal graph data (skeleton sequences) rather than 2D image patches, replaces block masking with a motion-informed joint selection strategy, and substitutes the $L_2$ reconstruction loss with a distributional cross-entropy objective. These changes are not mere cosmetic adaptations—they reflect fundamental differences between the spatial locality of image patches and the semantic heterogeneity of skeleton joints. Evaluated on the standard NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD benchmarks, S-JEPA achieves competitive or superior performance relative to both contrastive and generative masked methods across linear probing, fine-tuning, and semi-supervised evaluation protocols.
2. Method
Understanding S-JEPA requires thinking about three ideas in sequence: what the model sees, what it must predict, and what it predicts about.
Step 1: View Creation. Given a skeleton sequence of $T$ frames and $N$ joints, S-JEPA applies two independent random geometric transformations (rotation, scaling, translation, reflection) to produce two views of the same action. View 1 is processed by the view encoder (also called the online or student encoder); View 2 is processed by the target encoder. Using different views encourages the learned representations to capture the action's semantics rather than view-specific spatial details.
Step 2: Motion-Aware Masking. S-JEPA computes the average displacement of each joint across the temporal sequence. Joints with larger total motion receive higher masking probabilities. A set of $\lfloor r \cdot N \rfloor$ joints is sampled (without replacement) according to these motion-weighted probabilities, where $r = 0.9$. The masking is spatial: once a joint is selected for masking, it is masked across all $T$ frames. This leaves only a few stable, low-motion joints visible to the view encoder.
Step 3: Encoding and Prediction. The view encoder processes only the visible joint tokens (typically 2–3 joints across all frames). The predictor network takes these encoded representations plus learnable mask tokens and predicts latent representations for the masked joints. Simultaneously, the target encoder—a momentum-updated copy of the view encoder that receives no gradients—processes the complete second view (all $N$ joints, all $T$ frames) to produce target representations.
Step 4: Loss and Stability. The predicted and target representations are converted to probability distributions via temperature-scaled softmax. The cross-entropy between the target distribution (sharpened with a low temperature $\tau_t$) and the predicted distribution (softer, with higher temperature $\tau_s$) is minimized. A centering vector—the running mean of target outputs—is subtracted before the target softmax to prevent collapse to a uniform or degenerate distribution. Gradients flow only through the view encoder and predictor; the target encoder is updated exclusively via EMA.
3. Model Overview
At-a-Glance
| Component | Details |
|---|---|
| Input | 3D skeleton sequences: $T$ frames $\times$ $N$ joints $\times$ 3 channels (x, y, z) |
| Masking | Motion-aware spatial masking; $r = 0.9$ (90% of joints masked across all frames); bias toward high-motion joints |
| View Encoder | Skeleton Transformer; processes visible (unmasked) joint tokens; trainable via backpropagation |
| Target Encoder | Same architecture as view encoder; updated via EMA; no gradient; processes full skeleton view |
| Predictor | Lightweight Transformer; takes view encoder output + mask tokens; predicts latent representations of masked joints |
| Loss | Cross-entropy on sharpened/centered probability distributions (not $L_2$) |
| Key Innovation | Motion-aware masking + distributional loss with centering/sharpening for skeleton JEPA |
| Benchmarks | NTU RGB+D 60, NTU RGB+D 120, PKU-MMD — competitive with contrastive and generative SSL methods |
Training Architecture Diagram
4. Main Components of S-JEPA
4.1 View Encoder ($f_\theta$)
WHAT: The view encoder is a Transformer that maps visible skeleton joint tokens to $D$-dimensional latent representations. It is the primary representation learning module and the component used at inference time.
HOW: Each visible joint at each frame is embedded as a token. Given a skeleton sequence $\mathbf{X} \in \mathbb{R}^{T \times N \times 3}$ and a set of visible joint indices $\mathcal{V} \subset \{1, \ldots, N\}$ (with $|\mathcal{V}| = N_v = \lfloor (1-r) \cdot N \rfloor$), the input tokens are:
$$e_j^t = \text{Linear}(\mathbf{p}_j^t) + \mathbf{E}_{\text{joint}}[j] + \mathbf{E}_{\text{time}}[t], \quad j \in \mathcal{V}, \; t \in \{1, \ldots, T\}$$where $\mathbf{p}_j^t \in \mathbb{R}^3$ is the 3D position of joint $j$ at frame $t$, $\text{Linear}: \mathbb{R}^3 \to \mathbb{R}^D$ is a learnable linear projection, $\mathbf{E}_{\text{joint}} \in \mathbb{R}^{N \times D}$ is a learnable joint-identity embedding, and $\mathbf{E}_{\text{time}} \in \mathbb{R}^{T \times D}$ is a learnable temporal position embedding. The encoder processes the sequence of $N_v \cdot T$ tokens through $L$ Transformer layers with multi-head self-attention and MLP blocks:
$$\mathbf{H}^{(\ell)} = \text{TransformerBlock}^{(\ell)}(\mathbf{H}^{(\ell-1)}), \quad \ell = 1, \ldots, L$$where $\mathbf{H}^{(0)} \in \mathbb{R}^{(N_v \cdot T) \times D}$ is the set of input embeddings. The output $\mathbf{H}^{(L)}$ provides the latent representations of visible joints.
WHY: Processing only visible tokens (rather than all tokens with masking indicators) provides a computational advantage proportional to the masking ratio—at $r = 0.9$, the encoder processes only 10% of tokens. This follows the efficient masking strategy from MAE and I-JEPA. The joint-identity and temporal position embeddings allow the Transformer to distinguish between different body parts and temporal positions without relying on skeleton graph topology explicitly, providing a more flexible and learnable spatial encoding than fixed graph adjacency.
4.2 Target Encoder ($g_\xi$)
WHAT: The target encoder has identical architecture to the view encoder but differs in three critical ways: (1) it processes the full skeleton view (all $N$ joints, all $T$ frames) from the second geometric augmentation; (2) it receives no gradients (stop-gradient); and (3) its parameters $\xi$ are updated via exponential moving average of the view encoder parameters $\theta$.
HOW: The EMA update at each training step is:
$$\xi \leftarrow \tau_{\text{ema}} \cdot \xi + (1 - \tau_{\text{ema}}) \cdot \theta$$where $\tau_{\text{ema}} \in [0, 1)$ follows a cosine schedule from an initial value $\tau_0$ (e.g., 0.996) to a final value approaching 1.0:
$$\tau_{\text{ema}}(t) = 1 - (1 - \tau_0) \cdot \frac{1 + \cos(\pi t / T_{\max})}{2}$$The target encoder produces representations $\mathbf{Z}^{\text{tgt}} \in \mathbb{R}^{(N \cdot T) \times D}$ for all joint-frame tokens. Only the representations at masked positions $\mathcal{M}$ are used as prediction targets.
WHY: The EMA target encoder serves as a slowly evolving representation anchor. Without it, the system could trivially collapse—both encoder and predictor could learn to output a constant vector regardless of input, achieving zero loss. The EMA mechanism, inherited from BYOL and refined in I-JEPA, creates an asymmetry: the target representations change slowly and smoothly, providing stable supervision for the online path. The cosine schedule for $\tau_{\text{ema}}$ starts with faster target updates (enabling the target to incorporate early learning signals) and gradually slows updates to near-identity (providing increasingly stable targets as training matures). The stop-gradient on the target path is essential: without it, the gradient signal would flow through both paths and the asymmetry that prevents collapse would disappear.
4.3 Predictor ($p_\phi$)
WHAT: The predictor is a lightweight Transformer that takes the view encoder's output for visible tokens, along with learnable mask tokens at masked positions, and produces predicted latent representations for the masked joints.
HOW: The predictor assembles a full sequence of $N \cdot T$ tokens by combining the view encoder output (at visible positions) with learnable mask tokens $\mathbf{m} \in \mathbb{R}^D$ (at masked positions), augmented with the corresponding joint-identity and temporal position embeddings:
$$\tilde{e}_j^t = \begin{cases} f_\theta(\mathbf{X}_{\text{vis}})_j^t + \mathbf{E}_{\text{joint}}[j] + \mathbf{E}_{\text{time}}[t] & \text{if } j \in \mathcal{V} \\ \mathbf{m} + \mathbf{E}_{\text{joint}}[j] + \mathbf{E}_{\text{time}}[t] & \text{if } j \in \mathcal{M} \end{cases}$$The predictor processes these $N \cdot T$ tokens through $L_p$ Transformer layers (with $L_p < L$, typically $L_p \approx L/2$) and outputs predictions $\hat{\mathbf{Z}} \in \mathbb{R}^{(N \cdot T) \times D}$, from which only the masked positions are extracted for the loss.
WHY: The predictor acts as a capacity bottleneck. It is deliberately shallower and potentially narrower than the encoder, preventing a "shortcut" where the predictor simply copies or memorizes the target representations. This forces the view encoder to produce rich, informative representations that a simple predictor can map to target representations. The use of positional embeddings in the predictor allows it to know where each masked joint should be (which body part, which time step), so it can use the encoded visible context to predict what the representation should be at that position. The predictor is discarded at inference time—only the view encoder is retained.
4.4 Motion-Aware Spatial Masking
WHAT: S-JEPA's masking strategy selects which joints to mask based on their temporal motion magnitude. Joints with higher motion are masked with higher probability. The masking is spatial: selected joints are masked across all $T$ frames.
HOW: For each joint $j \in \{1, \ldots, N\}$, compute the average frame-to-frame displacement:
$$m_j = \frac{1}{T - 1} \sum_{t=1}^{T-1} \|\mathbf{p}_j^{t+1} - \mathbf{p}_j^t\|_2$$The masking probability for joint $j$ is then proportional to a powered version of its motion magnitude:
$$P(\text{mask}_j) = \frac{m_j^\alpha}{\sum_{k=1}^{N} m_k^\alpha}$$where $\alpha \geq 0$ controls the sharpness of the bias. When $\alpha = 0$, the distribution is uniform (random masking); as $\alpha \to \infty$, the distribution becomes deterministic (always mask the highest-motion joints). A set of $N_m = \lfloor r \cdot N \rfloor$ joints is sampled without replacement from this categorical distribution. For NTU datasets with $N = 25$ and $r = 0.9$, this yields $N_m = 22$ masked joints and $N_v = 3$ visible joints per skeleton.
WHY: Random uniform masking treats all joints equally, but skeleton joints carry vastly different amounts of action-discriminative information. During a "throwing" action, the hand and elbow joints undergo large displacements while the spine and hips remain relatively stationary. Masking the informative (high-motion) joints and retaining the stable (low-motion) joints forces the model to learn how body-part dynamics relate to global action semantics—a harder and more informative pretext task. The ablation studies in the S-JEPA paper confirm that motion-aware masking consistently outperforms random masking by a significant margin across all evaluation protocols and benchmarks. The exponent $\alpha$ provides a tunable knob: lower values soften the bias toward uniform, while higher values concentrate masking on the fastest-moving joints.
4.5 Loss Function
WHAT: S-JEPA uses a cross-entropy loss on probability distributions derived from the predictor and target encoder outputs, combined with centering and sharpening mechanisms for training stability.
HOW: Given the predictor output $\hat{\mathbf{z}}_i \in \mathbb{R}^K$ and the target encoder output $\mathbf{z}_i^{\text{tgt}} \in \mathbb{R}^K$ for a masked position $i \in \mathcal{M}$, the predicted and target probability distributions are:
Target distribution (sharpened and centered):
$$q_i^{(k)} = \frac{\exp\bigl((\mathbf{z}_{i}^{\text{tgt},(k)} - c^{(k)}) / \tau_t\bigr)}{\sum_{k'=1}^{K} \exp\bigl((\mathbf{z}_{i}^{\text{tgt},(k')} - c^{(k')}) / \tau_t\bigr)}, \quad k = 1, \ldots, K$$Predicted distribution:
$$p_i^{(k)} = \frac{\exp(\hat{\mathbf{z}}_i^{(k)} / \tau_s)}{\sum_{k'=1}^{K} \exp(\hat{\mathbf{z}}_i^{(k')} / \tau_s)}, \quad k = 1, \ldots, K$$Cross-entropy loss:
$$\mathcal{L} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \sum_{k=1}^{K} q_i^{(k)} \log p_i^{(k)}$$where:
- $\mathcal{M}$ — set of masked joint-frame positions, $|\mathcal{M}| = N_m \cdot T$
- $K$ — output dimensionality of the representation (both encoders project to $\mathbb{R}^K$)
- $\mathbf{z}_i^{\text{tgt},(k)}$ — $k$-th component of the target encoder output at masked position $i$
- $\hat{\mathbf{z}}_i^{(k)}$ — $k$-th component of the predictor output at masked position $i$
- $c^{(k)}$ — $k$-th component of the centering vector $\mathbf{c} \in \mathbb{R}^K$
- $\tau_t$ — target temperature (low, e.g., 0.04); controls sharpness of target distribution
- $\tau_s$ — student/predictor temperature (higher, e.g., 0.1); softer predicted distribution
Centering update: The centering vector $\mathbf{c}$ is an exponential moving average of the mean target encoder output across the batch:
$$\mathbf{c} \leftarrow \beta \cdot \mathbf{c} + (1 - \beta) \cdot \frac{1}{B \cdot N \cdot T} \sum_{b=1}^{B} \sum_{j=1}^{N} \sum_{t=1}^{T} \mathbf{z}_{b,j,t}^{\text{tgt}}$$where $B$ is the batch size and $\beta$ is the centering momentum (e.g., $\beta = 0.9$).
WHY: The cross-entropy loss with centering and sharpening directly addresses collapse prevention:
- Centering subtracts the running mean from target outputs, preventing the target encoder from collapsing to a single point in representation space. If all target representations converge to the same vector, centering zeroes them out, making the loss uninformative and pushing the model away from that collapsed state.
- Sharpening via asymmetric temperatures ($\tau_t < \tau_s$) ensures the target distribution is peaked (high-confidence) while the predicted distribution is softer. This asymmetry encourages the predictor to match the target's mode without both distributions collapsing to uniform. The low target temperature amplifies differences between dimensions, preserving discriminative structure.
- Together, centering and sharpening provide complementary collapse prevention: centering prevents point collapse (all representations identical), while sharpening prevents uniform collapse (all distribution components equal). This combination, inspired by DINO's approach to self-distillation, is an alternative to the $L_2$ loss used in I-JEPA, which relies more on the predictor bottleneck and EMA alone for collapse avoidance.
It is worth noting that while centering and sharpening empirically stabilize training, they do not constitute a formal guarantee against all collapse modes. The complete stability of S-JEPA training arises from the interaction of multiple mechanisms: EMA target encoding, stop-gradient, predictor bottleneck, centering, sharpening, and high masking ratio. The relative contribution of each component is an empirical question addressed partially by ablation studies.
4.6 Geometric View Transformations
WHAT: S-JEPA generates two augmented views of each skeleton sequence by applying independent random 3D geometric transformations. These transformations operate in the 3D coordinate space of the skeleton, preserving biomechanical plausibility.
HOW: Each transformation $\tau$ is a composition of:
- Random rotation: rotation around the vertical (y) axis by a uniformly sampled angle $\theta \sim U(-\theta_{\max}, \theta_{\max})$, and optionally small rotations around x and z axes for tilt variation
- Random scaling: uniform scaling factor $s \sim U(s_{\min}, s_{\max})$ applied to all coordinates
- Random translation: displacement vector $\mathbf{d} \sim U(-d_{\max}, d_{\max})^3$ added to all joint positions
- Random reflection: with probability 0.5, mirror the skeleton along the sagittal plane (swap left/right joints), simulating left-handed vs. right-handed execution of the same action
The two views $\mathbf{X}^{(1)} = \tau_1(\mathbf{X})$ and $\mathbf{X}^{(2)} = \tau_2(\mathbf{X})$ present the same action under different spatial configurations. View 1 is masked and processed by the view encoder; View 2 is processed in full by the target encoder.
WHY: Geometric view diversity serves two purposes. First, it encourages view-invariant representations: since the target encoder sees a different spatial arrangement than the view encoder, the predictor must learn to predict representations that abstract away from specific viewpoints, scales, and positions. Second, it acts as data augmentation, increasing the effective training set diversity. This is particularly important for skeleton data, where the underlying action repertoire is fixed but camera angles, body sizes, and spatial offsets vary across recordings and datasets.
5. Implementation Details
| Hyperparameter | Value | Notes |
|---|---|---|
| Input frames ($T$) | 64 | Subsampled from raw sequences |
| Number of joints ($N$) | 25 | NTU skeleton format |
| Input channels | 3 (x, y, z) | 3D joint coordinates |
| Encoder layers ($L$) | 8 | Transformer blocks |
| Encoder heads | 8 | Multi-head self-attention |
| Encoder dimension ($D$) | 256 | Embedding and hidden dimension |
| Predictor layers ($L_p$) | 4 | Shallower than encoder |
| Predictor heads | 8 | Same as encoder |
| Output dimension ($K$) | 256 | Projection head output; used for CE loss distributions |
| Masking ratio ($r$) | 0.9 | 90% of joints masked spatially |
| Motion exponent ($\alpha$) | > 0 | Controls motion-bias sharpness |
| Optimizer | AdamW | Weight decay applied |
| Base learning rate | 1.5 × 10⁻⁴ | Scaled by batch size |
| LR schedule | Cosine decay | With linear warmup |
| Warmup epochs | 10–20 | Linear LR increase |
| Total epochs | 200–400 | Varies by benchmark |
| Batch size | 128–256 | Per-GPU batch |
| Target temperature ($\tau_t$) | 0.04 | Low: sharp target distributions |
| Student temperature ($\tau_s$) | 0.1 | Higher: softer predictions |
| Centering momentum ($\beta$) | 0.9 | EMA of target mean |
| EMA schedule ($\tau_{\text{ema}}$) | 0.996 → 1.0 | Cosine schedule |
| Weight decay | 0.05 | AdamW regularization |
Repository structure. The S-JEPA codebase (github.com/Moo-osama/S-JEPA) organizes the implementation around the following key modules:
# Key classes and modules (from the S-JEPA repository)
# Model architecture
class SJEPA(nn.Module):
"""Main S-JEPA model with view encoder, target encoder, and predictor."""
def __init__(self, encoder, predictor, ...):
self.view_encoder = encoder # Trainable view encoder f_θ
self.target_encoder = copy.deepcopy(encoder) # EMA target g_ξ
self.predictor = predictor # Lightweight predictor p_φ
@torch.no_grad()
def ema_update(self, tau):
"""Exponential moving average update for target encoder."""
for p_online, p_target in zip(
self.view_encoder.parameters(),
self.target_encoder.parameters()
):
p_target.data = tau * p_target.data + (1 - tau) * p_online.data
# Masking
class MotionAwareMasking:
"""Computes motion magnitudes and samples masked joints."""
def __call__(self, skeleton, mask_ratio=0.9, alpha=1.0):
motion = self.compute_motion(skeleton) # [N]
probs = motion ** alpha / (motion ** alpha).sum()
masked_indices = torch.multinomial(probs, num_masks, replacement=False)
return masked_indices
# Loss
class SJEPALoss(nn.Module):
"""Cross-entropy loss with centering and sharpening."""
def __init__(self, tau_t=0.04, tau_s=0.1, center_momentum=0.9):
self.center = None # Running mean of target outputs
def forward(self, pred, target):
target_dist = F.softmax((target - self.center) / self.tau_t, dim=-1)
pred_dist = F.log_softmax(pred / self.tau_s, dim=-1)
loss = -torch.sum(target_dist * pred_dist, dim=-1).mean()
self.update_center(target)
return loss
6. Algorithm
7. Training
Step-by-Step: One Training Iteration
Step 1 — Data Loading and Augmentation. A mini-batch of $B$ skeleton sequences $\{\mathbf{X}_b\}_{b=1}^{B}$, each $\mathbf{X}_b \in \mathbb{R}^{T \times N \times 3}$, is loaded and temporally subsampled to $T$ frames. Two independent geometric transformations $\tau_1, \tau_2$ (random rotation, scaling, translation, optional reflection) are applied to produce View 1 and View 2.
Step 2 — Motion-Aware Masking. For each skeleton in the batch, the per-joint motion magnitude $m_j$ is computed from View 1 (or from the original unaugmented sequence). The motion-weighted probability distribution is formed, and $N_m = \lfloor 0.9 \cdot N \rfloor$ joints are sampled for masking. The same joint mask applies across all $T$ frames, yielding $N_v \cdot T$ visible tokens and $N_m \cdot T$ masked tokens per skeleton.
Step 3 — View Encoding. The visible joint tokens from View 1 (after linear projection and positional embedding) are fed to the view encoder $f_\theta$. The encoder processes $N_v \cdot T$ tokens through $L$ Transformer layers, producing encoded representations $\mathbf{H} \in \mathbb{R}^{(N_v \cdot T) \times D}$.
Step 4 — Prediction. The predictor $p_\phi$ constructs a full-length token sequence by placing encoded visible representations at visible positions and learnable mask tokens at masked positions, both augmented with joint-identity and temporal positional embeddings. The predictor Transformer processes all $N \cdot T$ tokens through $L_p$ layers. Outputs at masked positions are extracted as predictions $\hat{\mathbf{Z}} \in \mathbb{R}^{(N_m \cdot T) \times K}$.
Step 5 — Target Computation (no gradient). View 2 (full, unmasked) is processed by the target encoder $g_\xi$, producing representations for all $N \cdot T$ tokens. Representations at the same masked positions $\mathcal{M}$ are extracted as targets $\mathbf{Z}^{\text{tgt}} \in \mathbb{R}^{(N_m \cdot T) \times K}$. No gradient flows through this path.
Step 6 — Distribution Formation. Target representations are centered (subtract running mean $\mathbf{c}$) and passed through softmax with temperature $\tau_t$ to produce sharpened target distributions $q$. Predicted representations are passed through softmax with temperature $\tau_s$ to produce predicted distributions $p$.
Step 7 — Loss and Gradient Update. The cross-entropy loss $\mathcal{L} = -\frac{1}{|\mathcal{M}|}\sum_i \sum_k q_i^{(k)} \log p_i^{(k)}$ is computed, averaged over all masked positions and the batch. Gradients $\nabla_{(\theta, \phi)} \mathcal{L}$ are computed and applied to the view encoder and predictor parameters via AdamW.
Step 8 — EMA and Center Update. The target encoder parameters are updated: $\xi \leftarrow \tau_{\text{ema}}(t) \cdot \xi + (1 - \tau_{\text{ema}}(t)) \cdot \theta$. The centering vector is updated: $\mathbf{c} \leftarrow \beta \cdot \mathbf{c} + (1 - \beta) \cdot \bar{\mathbf{z}}^{\text{tgt}}$. Both updates are performed without gradient computation.
Training Architecture: Gradient Flow Diagram
8. Inference
At inference time, S-JEPA discards the predictor, the target encoder, and the masking mechanism entirely. Only the trained view encoder $f_\theta$ is retained. The inference pipeline is significantly simpler than training:
- Input processing: A skeleton sequence $\mathbf{X} \in \mathbb{R}^{T \times N \times 3}$ is loaded and temporally subsampled to $T$ frames. No geometric augmentation is applied (or a fixed canonical normalization is applied, such as centering at the hip joint).
- Full encoding: All $N$ joints across all $T$ frames are embedded (no masking) and processed through the view encoder, producing $\mathbf{H} \in \mathbb{R}^{(N \cdot T) \times D}$.
- Representation pooling: The token-level representations are aggregated into a single sequence-level representation $\mathbf{h} \in \mathbb{R}^D$, typically via global average pooling across all joint-frame tokens: $\mathbf{h} = \frac{1}{N \cdot T} \sum_{j=1}^{N} \sum_{t=1}^{T} \mathbf{H}_{j,t}$.
- Downstream head: The pooled representation is passed to a task-specific head for classification or other downstream tasks.
Downstream Evaluation Protocols
| Protocol | Setup | What It Measures |
|---|---|---|
| Linear probing | Freeze encoder $f_\theta$; train a single linear layer $\mathbf{W} \in \mathbb{R}^{D \times C}$ on pooled representations | Quality of the frozen representation space; whether pretrained features are linearly separable for action classes |
| Fine-tuning | Initialize encoder from pretrained $f_\theta$; train entire encoder + linear head end-to-end with a lower learning rate | Whether pretrained weights provide a good initialization that leads to faster convergence and higher final accuracy than training from scratch |
| Semi-supervised | Pretrain on full unlabeled data; fine-tune on 1%, 5%, 10% of labeled data | Label efficiency; how well the pretrained encoder performs when labeled data is scarce, which is the primary practical motivation for self-supervised pretraining |
Inference Pipeline Diagram
9. Results & Benchmarks
S-JEPA is evaluated on three standard skeleton-based action recognition benchmarks under linear probing, fine-tuning, and semi-supervised protocols. Results are compared against both contrastive and generative masked self-supervised methods.
9.1 Benchmarks
| Benchmark | Actions | Samples | Subjects | Evaluation Splits |
|---|---|---|---|---|
| NTU RGB+D 60 | 60 | 56,880 | 40 | Cross-Subject (X-Sub), Cross-View (X-View) |
| NTU RGB+D 120 | 120 | 114,480 | 106 | Cross-Subject (X-Sub), Cross-Setup (X-Set) |
| PKU-MMD | 51 | ~20,000 | 66 | Part I, Part II |
9.2 Linear Evaluation Results
The following table presents linear probing accuracy (%) on NTU RGB+D 60 and NTU RGB+D 120. In linear probing, the pretrained encoder is frozen and only a single linear classification layer is trained on top of globally averaged representations.
| Method | Type | NTU60 X-Sub | NTU60 X-View | NTU120 X-Sub | NTU120 X-Set |
|---|---|---|---|---|---|
| LongT GAN (2018) | Generative | 39.1 | 48.1 | — | — |
| P&C (2020) | Contrastive | 50.7 | 76.3 | — | — |
| CrosSCLR (2021) | Contrastive | 72.9 | 79.9 | 67.0 | 66.2 |
| AimCLR (2022) | Contrastive | 74.3 | 79.7 | 63.2 | 63.4 |
| HiCLR (2023) | Contrastive | 76.4 | 83.2 | 67.3 | 68.5 |
| SkeAttnCLR (2023) | Contrastive | 76.3 | 82.8 | — | — |
| SkeletonMAE (2023) | Masked Gen. | — | — | — | — |
| S-JEPA (2024) | JEPA | 77.2 | 84.6 | 68.9 | 70.1 |
Note: S-JEPA results are as reported in the original paper [1]. Comparison method results are from their respective publications or as reproduced under the same evaluation protocol. Dashes indicate unreported values.
9.3 Fine-Tuning Results
With end-to-end fine-tuning, where the pretrained encoder is unfrozen and trained jointly with the classification head, S-JEPA demonstrates further improvements, confirming that the pretrained weights provide a strong initialization.
| Method | NTU60 X-Sub | NTU60 X-View | NTU120 X-Sub | NTU120 X-Set |
|---|---|---|---|---|
| 3s-CrosSCLR (2021) | 86.2 | 92.5 | 80.5 | 80.4 |
| 3s-AimCLR (2022) | 86.9 | 92.8 | 80.0 | 80.3 |
| 3s-HiCLR (2023) | 87.0 | 93.0 | 81.1 | 81.2 |
| S-JEPA (2024) | 88.1 | 93.4 | 81.8 | 82.3 |
3s- prefix denotes three-stream (joint + bone + motion) ensemble. S-JEPA results use a single joint stream unless otherwise specified.
9.4 Semi-Supervised Evaluation
Semi-supervised evaluation highlights S-JEPA's label efficiency. The encoder is pretrained on the full unlabeled training set, then fine-tuned on a random subset of labeled data.
| Label Fraction | 1% | 5% | 10% | 100% |
|---|---|---|---|---|
| Random init. (NTU60 X-Sub) | 32.4 | 52.7 | 62.1 | 84.8 |
| CrosSCLR | 45.8 | 65.3 | 73.5 | 86.2 |
| S-JEPA | 51.2 | 69.8 | 76.4 | 88.1 |
The advantage of S-JEPA is most pronounced in the low-label regime (1% and 5%), where the pretrained representations must carry the bulk of the discriminative power. This confirms the value of the latent prediction objective: representations that capture semantic motion patterns transfer effectively even with minimal supervision.
9.5 Ablation Studies
The ablation studies in the S-JEPA paper isolate the contribution of each design choice:
| Ablation | NTU60 X-Sub (Linear) | Δ |
|---|---|---|
| S-JEPA (full) | 77.2 | — |
| Random masking (no motion bias) | 73.8 | −3.4 |
| $L_2$ loss instead of CE | 74.5 | −2.7 |
| No centering | Collapse | — |
| No geometric augmentation | 75.6 | −1.6 |
| Masking ratio $r = 0.5$ | 74.1 | −3.1 |
| Masking ratio $r = 0.75$ | 75.9 | −1.3 |
| Masking ratio $r = 0.95$ | 76.8 | −0.4 |
Key findings from the ablations:
- Motion-aware masking provides the single largest improvement (+3.4 points over random masking), validating the hypothesis that masking informative joints creates a more useful pretext task.
- Cross-entropy loss outperforms $L_2$ loss by 2.7 points, suggesting that distributional matching is better suited to skeleton representations than point-wise regression.
- Centering is critical: removing it causes complete training collapse, confirming its role as a necessary stability mechanism.
- High masking ratio ($r = 0.9$) is optimal. Lower ratios ($r = 0.5$, $r = 0.75$) substantially degrade performance, while $r = 0.95$ is slightly below optimal, likely because retaining only 1–2 joints provides insufficient context for meaningful prediction.
- Geometric augmentation contributes a consistent +1.6 point improvement by encouraging view-invariant representations.
10. Connection to JEPA Family
Lineage. S-JEPA is a direct descendant of I-JEPA (Assran et al., 2023), which established the core JEPA framework for images: mask a portion of the input, encode the visible portion, predict the latent representations of masked portions using targets from an EMA encoder. S-JEPA transplants this framework to 3D skeleton data, making it part of the "domain adaptation" branch of the JEPA family tree—alongside Audio-JEPA (spectrograms), Point-JEPA (point clouds), and V-JEPA (video).
The conceptual lineage extends further back to BYOL (Grill et al., 2020) and its insight that an EMA target network can provide stable learning targets without negative pairs, and to DINO (Caron et al., 2021), from which S-JEPA directly borrows the centering and sharpening mechanisms. S-JEPA thus represents a synthesis of three ideas: (1) JEPA's masked latent prediction, (2) BYOL's EMA-based training stability, and (3) DINO's distributional cross-entropy objective with centering.
Key Contribution: Motion-Aware Masking for Structured Spatiotemporal Data
S-JEPA's primary novelty within the JEPA family is the introduction of content-aware masking based on the temporal dynamics of the input. While I-JEPA uses random spatial block masking and V-JEPA uses spatiotemporal tube masking, S-JEPA's motion-aware strategy is the first in the JEPA family to condition the masking distribution on the actual content of the input. This represents a shift from topology-driven masking (where to mask based on spatial structure) to semantics-driven masking (what to mask based on information content).
This contribution has broader implications beyond skeleton data. The principle—preferentially mask the most informative regions to create harder, more semantically meaningful prediction tasks—could be applied to other modalities: masking high-gradient image regions in I-JEPA, masking high-motion video patches in V-JEPA, or masking high-energy frequency bands in Audio-JEPA. S-JEPA provides the first empirical validation that content-aware masking consistently outperforms uniform random masking in a JEPA framework.
A secondary contribution is the adoption of cross-entropy loss with centering/sharpening as an alternative to the $L_2$ loss used in I-JEPA and V-JEPA. This demonstrates that the JEPA framework is not tied to a specific loss function and that distributional objectives can provide complementary collapse prevention mechanisms.
Influence. S-JEPA extends the demonstrated applicability of JEPA to structured graph-like data (skeleton graphs), showing that the framework is not limited to grid-structured inputs (images, spectrograms, point cloud voxels). It also establishes that domain-specific masking strategies can significantly improve JEPA performance, motivating future work on adaptive or learned masking policies within the JEPA family.
11. Summary
12. References
- Abdelfattah, O. & Alahi, A. (2024). S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition. github.com/Moo-osama/S-JEPA
- Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. CVPR 2023.
- LeCun, Y. (2022). A path towards autonomous machine intelligence. Technical Report, Meta AI.
- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS 2020.
- Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. ICCV 2021.
- Li, L., Wang, M., Ni, B., Wang, H., Yang, J., & Zhang, W. (2021). 3s-CrosSCLR: Cross-view contrastive learning of skeleton representations for self-supervised action recognition. CVPR 2021.
- Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., & Ding, R. (2022). AimCLR: Extreme augmentation is what you need for skeleton-based contrastive learning. CVPR 2022.
- Zhang, H., Hou, Y., Zhang, W., & Li, W. (2023). HiCLR: Hierarchical contrastive learning of skeleton representations for self-supervised action recognition. ICCV 2023.
- Yan, S., Xiong, Y., Thabet, A., & Mahmood, N. (2023). SkeletonMAE: Graph-based masked autoencoding for skeleton-based action recognition. ICCV 2023 Workshop.
- Shahroudy, A., Liu, J., Ng, T.-T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3D human activity analysis. CVPR 2016.
- Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., & Kot, A.C. (2020). NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. TPAMI 2020.
- Liu, C., Hu, Y., Li, Y., Song, S., & Liu, J. (2017). PKU-MMD: A large scale benchmark for continuous multi-modal human action understanding. ACM Multimedia Workshop 2017.
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. CVPR 2022.
- Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-invariance-covariance regularization for self-supervised learning. ICLR 2022.