DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

Shunsuke He, Toshihiko Sakai, Akash Chandhok, Sara Beery, Jiawei Yuan, Nicolas Padoy, Yoshimitsu Hasegawa, Leonid Sigal — 2026

1. Introduction

Self-supervised visual representation learning via Joint-Embedding Predictive Architectures (JEPA) has established that predicting in latent space—rather than in pixel space—produces representations that emphasize semantic content over low-level texture. I-JEPA (Assran et al., 2023) demonstrated this convincingly on ImageNet: a Vision Transformer encodes visible context patches, a lightweight predictor network fills in the representations of randomly masked target blocks, and an exponential-moving-average (EMA) target encoder provides the regression targets. The approach avoids pixel-level reconstruction artifacts and hand-crafted augmentations while learning features competitive with contrastive and masked-image-modeling baselines.

However, I-JEPA treats all target blocks as equally informative and predicts them in parallel, without regard for their semantic importance. Random block masking is spatially agnostic: the model is equally likely to predict a patch of uniform background as one covering a discriminative object part. Furthermore, because the predictor processes all mask tokens simultaneously through shared self-attention layers, the prediction targets form an unordered set—permutation symmetry holds in practice, and the model receives no learning signal about which regions matter most or should be resolved first.

This limitation is especially damaging for fine-grained recognition tasks where success depends on attending to subtle, spatially localized cues—the beak pattern that distinguishes two warbler species, the grille shape that differentiates car models, or the petal arrangement that identifies a plant genus. When target blocks are sampled uniformly at random, the model spends equal capacity predicting uninformative background regions and critical discriminative regions, diluting the learning signal available for the latter.

DSeq-JEPA (He et al., 2026) addresses these shortcomings with two interlocking modifications to the I-JEPA framework:

Attention-derived saliency masking. Instead of sampling target blocks uniformly, DSeq-JEPA computes a saliency map from the context encoder's self-attention weights, ranks image regions by discriminative importance, and selects the top-$K$ most salient regions as prediction targets. This focuses the predictive objective on semantically rich areas.
Sequential next-region prediction. Rather than predicting all target blocks in parallel, the predictor resolves targets one at a time in descending order of saliency. At each step $t$, the predictor has access to the original context tokens plus the predictions from all prior steps $1, \dots, t{-}1$, creating an autoregressive chain in latent space. This breaks permutation symmetry and imposes a curriculum-like progression from primary to secondary semantic cues.

The combination produces a training signal that is both more focused (discriminative regions first) and more structured (sequential dependencies encode region importance). Empirically, DSeq-JEPA improves over I-JEPA on ImageNet-1K linear probing and, more significantly, delivers substantial gains on fine-grained benchmarks including iNaturalist 2021, CUB-200-2011, and Stanford Cars, as well as on dense prediction tasks (MS-COCO detection/segmentation, ADE20K semantic segmentation).

2. Method

Intuition: The Art Critic Analogy. Imagine you are learning to analyze a painting. A novice might scan the canvas haphazardly, spending equal time on the gilt frame and the subject's face. An expert critic, by contrast, first locks onto the most informative region—the subject's expression, perhaps—and uses that anchor to interpret progressively less obvious details: the hands, the background symbolism, the brushwork. DSeq-JEPA formalizes this strategy for visual self-supervised learning. It uses the model's own attention to identify which regions carry the most information, then predicts those regions in order of decreasing importance. Each prediction feeds into the next, so the model builds a coherent interpretation from the most to the least discriminative parts of the image.

The method can be decomposed into three phases within each training iteration:

Phase 1: Encode and Attend. An input image is patchified and fed through the context encoder (a Vision Transformer). As a byproduct of the forward pass, the encoder produces multi-head self-attention maps. These maps are aggregated across heads and layers to form a spatial saliency map—a heatmap indicating which patches the encoder considers most informative for its internal representations.

Phase 2: Rank and Select. The saliency map is used to rank image regions (groups of contiguous patches) from most to least discriminative. The top-$K$ regions are selected as prediction targets. Unlike I-JEPA's random block sampling, this selection is content-adaptive: images with a centered subject will have targets clustered on the subject, while images with distributed texture may have more spread-out targets.

Phase 3: Predict Sequentially. The predictor network processes the targets in saliency-ranked order. At step 1, it receives only the visible context tokens plus a set of positional mask tokens for the first (most salient) target region and produces a latent prediction. At step 2, the predicted representation from step 1 is appended to the predictor's input alongside the context and the positional mask tokens for the second target region. This continues for all $K$ steps. The EMA target encoder provides ground-truth representations for each target region, and the loss accumulates across all steps.

Intuition: Why Sequential Matters. Consider two target blocks: one covering a bird's head, another covering its tail. In I-JEPA's parallel scheme, the predictor resolves both from context alone, independently. In DSeq-JEPA, the predictor first resolves the head (the most discriminative region), then uses that resolved representation to help predict the tail. This mirrors how parts relate hierarchically—knowing "it's a warbler" (from the head) constrains what the tail should look like. The sequential chain forces the predictor to encode compositional, part-to-whole relationships rather than treating each target in isolation.

Crucially, the saliency computation imposes negligible overhead because the attention maps are already computed during the encoder's forward pass—no additional network or forward pass is required. The sequential prediction does add $K{-}1$ additional predictor passes per iteration compared to I-JEPA's single parallel pass, but since the predictor is lightweight (typically 6–12 transformer blocks compared to the encoder's 24+), the wall-clock cost increase is modest.

3. Model Overview

At-a-Glance

Component	Detail
Input	Image patches (e.g., 16×16 or 14×14 pixels per patch)
Masking	Attention-derived saliency map; regions ranked by discriminative importance; top-$K$ selected as sequential targets
Context Encoder	ViT-L/16 or ViT-H/14 (trainable via gradient descent)
Target Encoder	Same architecture as context encoder; updated via EMA (frozen to gradients)
Predictor	Narrow transformer (e.g., 12 blocks, 384-dim); processes targets sequentially
Loss	Smooth-$\ell_1$ (Huber) loss on predicted vs. target representations, summed across sequential steps
Key Result	Consistent improvements over I-JEPA on ImageNet linear probe and substantial gains on fine-grained benchmarks (CUB, iNat21, Stanford Cars)
Params (ViT-L/16)	~307M encoder + ~38M predictor

Training Architecture Diagram

Figure 1. DSeq-JEPA training architecture. The context encoder processes visible patches and produces attention-derived saliency maps. Regions are ranked by saliency and fed to the sequential predictor in descending order. Each prediction step conditions on all prior predictions. The EMA target encoder provides ground-truth representations. Gradients flow through the loss to update the predictor and context encoder; the target encoder is updated via EMA.

4. Main Components of DSeq-JEPA

4.1 Context Encoder $f_\theta$

WHAT. The context encoder is a standard Vision Transformer (ViT) that processes only the visible (unmasked) patches of the input image. It produces contextualized token representations and, as a byproduct, multi-head self-attention maps used for saliency computation.

HOW. Following I-JEPA conventions, DSeq-JEPA uses ViT-L/16 (24 layers, 16 heads, $D = 1024$, patch size $16 \times 16$) or ViT-H/14 (32 layers, 16 heads, $D = 1280$, patch size $14 \times 14$) as the encoder backbone. For a $224 \times 224$ image with patch size $16$, the encoder produces $N = 196$ patch tokens (when no masking is applied to the encoder input). In practice, the context encoder receives $N_\text{ctx}$ tokens—those not selected as targets. Standard learnable positional embeddings are added. The output dimensionality matches the ViT embedding dimension $D$.

WHY. The ViT backbone provides two critical affordances: (1) high-quality representations when scaled, and (2) explicit attention maps that can be repurposed for saliency computation without additional modules. The authors leverage property (2) to avoid any auxiliary saliency network, keeping the method architecturally simple. Ablations in the paper compare ViT-B/16, ViT-L/16, and ViT-H/14, showing consistent gains from DSeq-JEPA's sequential strategy across all scales, with the largest absolute improvements at the ViT-L/16 scale.

4.2 Target Encoder $f_\xi$ (EMA)

WHAT. The target encoder is an identical ViT that processes the full (unmasked) image and produces the ground-truth representations against which the predictor's outputs are compared. It is not trained via gradient descent; instead, its parameters $\xi$ are an exponential moving average of the context encoder's parameters $\theta$.

HOW. The EMA update rule is:

$$\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$$

where $\tau$ follows a cosine schedule from $\tau_0 = 0.996$ to $\tau_1 = 1.0$ over the course of training. The target encoder processes all $N$ patch tokens (no masking) so that its output representations capture full-image context. Representations for the target regions are then extracted by indexing into the output token sequence at the spatial positions identified by the saliency-based region selector.

WHY. The EMA target encoder serves two purposes: (1) it provides slowly evolving, stable prediction targets that prevent training oscillations, and (2) the asymmetry between the gradient-updated context encoder and the EMA-updated target encoder, combined with the predictor's bottleneck architecture, helps the system avoid representational collapse (where all patches map to identical representations). This is a design inherited from BYOL and I-JEPA; DSeq-JEPA does not modify the EMA mechanism itself.

4.3 Sequential Predictor $g_\phi$

WHAT. The predictor is a narrow transformer that takes context encoder outputs and positional mask tokens as input and produces latent predictions for the target regions. In DSeq-JEPA, unlike I-JEPA's single-pass parallel predictor, the predictor is called $K$ times in sequence, once per target region in saliency-ranked order.

HOW. The predictor uses a smaller transformer (e.g., 12 blocks, embedding dimension $D_p = 384$, 12 heads). At sequential step $t$:

The predictor input consists of: (a) context encoder outputs $\{h_i\}_{i \in \mathcal{C}}$ projected to $D_p$, (b) predicted representations from prior steps $\{\hat{s}_1, \dots, \hat{s}_{t-1}\}$, and (c) learnable mask tokens $\{m_j\}_{j \in \mathcal{R}_t}$ with positional embeddings for the target region $\mathcal{R}_t$.
All tokens pass through the predictor's self-attention layers.
Output tokens at the mask-token positions are extracted and projected back to dimension $D$ to produce $\hat{s}_t \in \mathbb{R}^{|\mathcal{R}_t| \times D}$.

Crucially, the predictor at step $t$ has access to the predictions from steps $1, \dots, t{-}1$ but not to the target encoder's ground-truth representations for those steps. This means the predictor must chain its own predictions forward, creating a dependency that incentivizes coherent, accumulative scene understanding.

WHY. The sequential design provides three benefits validated by ablations:

Permutation symmetry breaking. In I-JEPA, target blocks are processed as an unordered set. The predictor receives the same learning signal regardless of which block it attends to first internally. DSeq-JEPA imposes a strict ordering based on semantic importance, creating a richer learning objective.
Compositional reasoning. Later predictions are conditioned on earlier ones, encouraging the predictor to model inter-region relationships (e.g., predicting a bird's tail is easier after predicting its head).
Curriculum effect. The most discriminative (and often hardest to infer from context alone) regions are predicted first, while secondary regions benefit from accumulated predictions. Ablations show that reversing the order (least-to-most discriminative) degrades performance, confirming the importance of the saliency-based ordering.

4.4 Saliency-Based Masking Strategy

WHAT. DSeq-JEPA replaces I-JEPA's random block masking with an attention-derived, content-adaptive masking strategy. The encoder's own attention maps are aggregated into a per-patch saliency score, and patches are grouped into regions that are ranked by average saliency.

HOW. Given the context encoder's multi-head self-attention weights $A^{(l,h)} \in \mathbb{R}^{N \times N}$ at layer $l$ and head $h$, the saliency score for patch $i$ is computed as:

$$\sigma_i = \frac{1}{LH} \sum_{l=1}^{L} \sum_{h=1}^{H} A^{(l,h)}_{\text{[CLS]}, i}$$

where $A^{(l,h)}_{\text{[CLS]}, i}$ is the attention weight from the [CLS] token to patch $i$, and $L, H$ are the number of layers and heads, respectively. In the absence of a [CLS] token, the average attention received by each patch (column-wise mean of the attention matrix) can serve as an alternative saliency measure.

Patches are then grouped into contiguous rectangular regions (blocks) following I-JEPA's block generation procedure, and each block's saliency is the average $\sigma$ over its constituent patches. The top-$K$ blocks by saliency score are selected as targets, and the remaining patches form the context set $\mathcal{C}$.

Figure 2. Masking comparison. Left: I-JEPA samples target blocks randomly across the image and predicts them in parallel. Right: DSeq-JEPA selects blocks ranked by attention saliency $\sigma$ and predicts them sequentially from most (T1) to least (T4) discriminative. Blocks concentrate on the object region, and each prediction step conditions on prior predictions.

WHY. Ablation studies in the paper demonstrate the importance of saliency-based masking by comparing three masking strategies under the sequential prediction framework:

Masking Strategy	ImageNet Lin. (%)	CUB Lin. (%)
Random blocks (I-JEPA baseline)	75.5	58.2
Random blocks + sequential prediction	76.1	60.8
Saliency blocks + parallel prediction	76.4	62.1
Saliency blocks + sequential prediction (DSeq-JEPA)	77.3	65.7

Both components contribute, but the combination delivers superadditive gains, especially on fine-grained benchmarks. The saliency-based masking alone improves CUB by +3.9 points over random, while sequential prediction alone adds +2.6 points. Together, they add +7.5 points—confirming that the saliency ordering and the sequential conditioning are synergistic.

The paper also ablates the ordering direction and finds that predicting from least to most discriminative (reversed order) underperforms the most-to-least ordering by 1.8 points on ImageNet and 4.2 points on CUB, validating the curriculum hypothesis.

4.5 Loss Function

WHAT. DSeq-JEPA uses a smooth-$\ell_1$ (Huber) loss to compare predicted representations with target encoder representations, summed across all sequential prediction steps.

HOW. Let $K$ be the number of sequential target regions, $\mathcal{R}_t$ the set of patch indices in the $t$-th target region (ordered by decreasing saliency), $\hat{s}_{t,j} \in \mathbb{R}^D$ the predictor's output for patch $j$ in region $t$, and $s_{t,j} \in \mathbb{R}^D$ the target encoder's output for the same patch. The per-sample loss is:

$$\mathcal{L} = \sum_{t=1}^{K} \frac{1}{|\mathcal{R}_t|} \sum_{j \in \mathcal{R}_t} \text{SmoothL1}(\hat{s}_{t,j}, s_{t,j})$$

where the Smooth-$\ell_1$ loss is defined element-wise and then summed over the $D$ dimensions:

$$\text{SmoothL1}(\hat{s}, s) = \sum_{d=1}^{D} \begin{cases} \frac{1}{2\beta}(\hat{s}_d - s_d)^2 & \text{if } |\hat{s}_d - s_d| < \beta \\ |\hat{s}_d - s_d| - \frac{\beta}{2} & \text{otherwise} \end{cases}$$

with $\beta = 2.0$ following I-JEPA. The variables are:

Symbol	Description	Typical Value
$K$	Number of target regions (sequential steps)	4
$\mathcal{R}_t$	Set of patch indices in the $t$-th target region	~15–30 patches each
$\hat{s}_{t,j}$	Predicted representation for patch $j$ at step $t$	$\in \mathbb{R}^D$
$s_{t,j}$	Target encoder representation for patch $j$ at step $t$	$\in \mathbb{R}^D$
$D$	Representation dimension	1024 (ViT-L)
$\beta$	Huber loss threshold	2.0

Target representations $s_{t,j}$ are computed by the EMA target encoder from the full unmasked image, then indexed at positions in $\mathcal{R}_t$. Gradients from $\mathcal{L}$ flow through $\hat{s}_{t,j}$ to update both the predictor parameters $\phi$ and the context encoder parameters $\theta$; the target encoder parameters $\xi$ receive no gradients (stop-gradient) and are updated only via EMA.

WHY. The smooth-$\ell_1$ loss is inherited from I-JEPA and is preferred over MSE because it is less sensitive to outliers in representation space, providing a more stable training signal. The per-region averaging $\frac{1}{|\mathcal{R}_t|}$ normalizes for variable region sizes so that larger regions do not dominate the loss. The sum over sequential steps (rather than a weighted sum) treats all prediction steps equally; ablations in the paper show that introducing a decay weight (down-weighting later, easier predictions) does not improve performance, suggesting that all steps provide useful learning signal.

4.6 Saliency-Ordering Module (Variant-Specific Component)

WHAT. The saliency-ordering module is the novel bridge between the encoder's attention maps and the predictor's sequential input construction. It is not a learned component but a deterministic procedure that converts continuous attention scores into a discrete, ordered sequence of target regions.

HOW. The procedure involves three sub-steps:

Attention aggregation. Multi-head, multi-layer attention maps are averaged to produce a single saliency vector $\boldsymbol{\sigma} \in \mathbb{R}^N$ (one score per patch).
Block proposal. Candidate target blocks are generated following I-JEPA's block sampling procedure (random aspect ratios within a specified range, random scale within a range). For each candidate block, its saliency score is the mean of $\sigma_i$ over its constituent patches.
Top-$K$ selection and ordering. The $K$ blocks with highest saliency scores are selected (with overlap removal via non-maximum suppression to ensure spatial diversity), and they are sorted in descending order of saliency.

The remaining patches (not covered by any of the $K$ selected blocks) form the context set. Note that this reverses I-JEPA's masking logic: I-JEPA first samples target blocks and then the context is the complement, whereas DSeq-JEPA's context depends on which blocks are selected by saliency, meaning the context set is content-adaptive.

WHY. The saliency-ordering module is the key differentiator from I-JEPA. By making target selection content-adaptive, DSeq-JEPA ensures that the model's predictive capacity is allocated to the most informative image regions. The ordering then converts this spatial selection into a temporal curriculum. The module is parameter-free and adds negligible computation (attention maps are already computed).

5. Implementation Details

Hyperparameter	ViT-L/16	ViT-H/14
Encoder layers	24	32
Encoder heads	16	16
Encoder dim $D$	1024	1280
Patch size	16 × 16	14 × 14
Image resolution	224 × 224	224 × 224
Predictor layers	12	12
Predictor heads	12	12
Predictor dim $D_p$	384	384
Number of target regions $K$	4	4
Target block scale range	[0.15, 0.2]	[0.15, 0.2]
Target block aspect ratio	[0.75, 1.5]	[0.75, 1.5]
Optimizer	AdamW	AdamW
Base learning rate	1.5e-4	1.5e-4
LR schedule	Cosine decay	Cosine decay
Warmup epochs	40	40
Total epochs	300	300
Batch size	2048	2048
Weight decay	0.05	0.05
EMA schedule ($\tau$)	0.996 → 1.0 (cosine)	0.996 → 1.0 (cosine)
Smooth-L1 $\beta$	2.0	2.0
GPUs	16 × A100 (40GB)	32 × A100 (80GB)
Saliency attention layers	All layers averaged	All layers averaged
NMS overlap threshold	0.5	0.5

The DSeq-JEPA codebase (available at https://github.com/SkyShunsuke/DSeq-JEPA) is structured around the following key classes and modules:

# Key classes from the DSeq-JEPA repository
# src/models/vision_transformer.py — VisionTransformer (encoder backbone)
# src/models/predictor.py — SequentialPredictor (narrow transformer, called K times)
# src/masks/saliency.py — SaliencyMaskCollator (computes saliency, ranks regions)
# src/trainer.py — DSeqJEPATrainer (training loop with sequential prediction)
# src/utils/ema.py — EMA update utilities

The SaliencyMaskCollator extracts attention maps from the context encoder's forward pass via PyTorch forward hooks registered on the attention layers. The SequentialPredictor maintains an internal buffer of prior predictions that is extended at each step before the next forward pass.

6. Algorithm

Algorithm 1: DSeq-JEPA Training (One Epoch)

Input: Dataset $\mathcal{D}$, context encoder $f_\theta$, target encoder $f_\xi$, sequential predictor $g_\phi$, number of target regions $K$, EMA momentum schedule $\tau(\cdot)$, learning rate schedule $\eta(\cdot)$

Output: Updated parameters $\theta$, $\phi$, $\xi$

1 for each mini-batch $\{x_1, \dots, x_B\} \sim \mathcal{D}$ do

2 // Forward pass through context encoder (full image, to get attention)

3 $H_\text{full}, \{A^{(l,h)}\} \leftarrow f_\theta(x)$ // tokens + attention maps

4 // Compute saliency scores

5 $\sigma_i \leftarrow \frac{1}{LH} \sum_{l,h} A^{(l,h)}_{\text{[CLS]},i}$ for each patch $i$

6 // Generate candidate blocks, score by mean saliency, select top-K with NMS

7 $\{\mathcal{R}_1, \dots, \mathcal{R}_K\} \leftarrow \textsc{SaliencySelect}(\boldsymbol{\sigma}, K)$ // ordered: $\bar{\sigma}(\mathcal{R}_1) \geq \cdots \geq \bar{\sigma}(\mathcal{R}_K)$

8 $\mathcal{C} \leftarrow \{1, \dots, N\} \setminus \bigcup_{t=1}^{K} \mathcal{R}_t$ // context set

9 // Re-encode only context patches

10 $H_\text{ctx} \leftarrow f_\theta(x[\mathcal{C}])$ // $N_\text{ctx} \times D$

11 // Target encoder: full image (no masking, stop-gradient)

12 $S \leftarrow \text{sg}(f_\xi(x))$ // $N \times D$, detached

13 // Sequential prediction

14 $\mathcal{L} \leftarrow 0$; $\text{buffer} \leftarrow \emptyset$

15 for $t = 1, \dots, K$ do

16 $M_t \leftarrow \text{MaskTokens}(\mathcal{R}_t)$ // learnable mask tokens + positional emb for region $t$

17 $\text{input}_t \leftarrow \text{Concat}[\text{Proj}(H_\text{ctx}),\ \text{buffer},\ M_t]$

18 $\hat{S}_t \leftarrow g_\phi(\text{input}_t)[\mathcal{R}_t]$ // extract predictions at mask positions, $|\mathcal{R}_t| \times D$

19 $\mathcal{L} \leftarrow \mathcal{L} + \frac{1}{|\mathcal{R}_t|} \sum_{j \in \mathcal{R}_t} \text{SmoothL1}(\hat{S}_{t,j},\ S[\mathcal{R}_t]_j)$

20 $\text{buffer} \leftarrow \text{buffer} \cup \{\hat{S}_t \text{ with pos. emb. for } \mathcal{R}_t\}$ // append predictions

21 end for

22 // Backward pass and parameter update

23 $\theta \leftarrow \theta - \eta \cdot \nabla_\theta \mathcal{L}$; $\phi \leftarrow \phi - \eta \cdot \nabla_\phi \mathcal{L}$

24 // EMA update of target encoder

25 $\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$

26 end for

Algorithm 2: Saliency-Based Region Selection

Input: Saliency vector $\boldsymbol{\sigma} \in \mathbb{R}^N$, number of targets $K$, block scale range $[s_\text{min}, s_\text{max}]$, aspect ratio range $[a_\text{min}, a_\text{max}]$, NMS threshold $\eta_\text{nms}$

Output: Ordered target regions $\{\mathcal{R}_1, \dots, \mathcal{R}_K\}$, context set $\mathcal{C}$

1 // Generate M candidate blocks (M >> K)

2 $\text{candidates} \leftarrow \emptyset$

3 for $m = 1, \dots, M$ do

4 Sample scale $s \sim \text{Uniform}[s_\text{min}, s_\text{max}]$, aspect $a \sim \text{Uniform}[a_\text{min}, a_\text{max}]$

5 Compute block height $h = \lfloor\sqrt{s \cdot N / a}\rfloor$, width $w = \lfloor s \cdot N / h \rfloor$

6 Sample random position $(r, c)$ on the patch grid

7 $\mathcal{B}_m \leftarrow$ set of patch indices in the $h \times w$ block at $(r, c)$

8 $\bar{\sigma}_m \leftarrow \frac{1}{|\mathcal{B}_m|} \sum_{i \in \mathcal{B}_m} \sigma_i$ // mean saliency of block

9 $\text{candidates} \leftarrow \text{candidates} \cup \{(\mathcal{B}_m, \bar{\sigma}_m)\}$

10 end for

11 // Sort candidates by saliency (descending)

12 Sort candidates by $\bar{\sigma}_m$ in descending order

13 // Greedy NMS selection

14 $\text{selected} \leftarrow \emptyset$

15 for each $(\mathcal{B}_m, \bar{\sigma}_m)$ in sorted order do

16 if $\text{IoU}(\mathcal{B}_m, \mathcal{B}') < \eta_\text{nms}$ for all $\mathcal{B}' \in \text{selected}$ then

17 $\text{selected} \leftarrow \text{selected} \cup \{\mathcal{B}_m\}$

18 if $|\text{selected}| = K$ then break

19 end if

20 end for

21 $\{\mathcal{R}_1, \dots, \mathcal{R}_K\} \leftarrow \text{selected}$ (already sorted by saliency)

22 $\mathcal{C} \leftarrow \{1, \dots, N\} \setminus \bigcup_{t=1}^{K} \mathcal{R}_t$

23 return $\{\mathcal{R}_1, \dots, \mathcal{R}_K\}$, $\mathcal{C}$

7. Training

Step-by-Step: One Training Iteration

Sample mini-batch. Draw $B$ images from ImageNet-1K (no labels used). Apply minimal augmentations (random resized crop to 224×224, horizontal flip). No color jitter, no multi-crop—following the I-JEPA principle of avoiding hand-crafted augmentations.
Initial forward pass for saliency. Pass the full (un-masked) batch through the context encoder $f_\theta$ with attention recording enabled. Extract multi-head self-attention maps $\{A^{(l,h)}\}$ from all layers and compute per-patch saliency scores $\boldsymbol{\sigma}$ by averaging [CLS]-to-patch attention across all heads and layers. Shape: $B \times N$.
Saliency-based region selection. For each image independently, generate candidate target blocks (random positions, constrained scale and aspect ratio), score each by mean saliency, and select the top-$K = 4$ via greedy NMS. This yields ordered target regions $\{\mathcal{R}_1, \dots, \mathcal{R}_4\}$ and the context set $\mathcal{C}$ per image. Note that $\mathcal{C}$ and $\mathcal{R}_t$ vary across images in the batch.
Context encoding. Re-encode only the context patches $x[\mathcal{C}]$ through $f_\theta$ (this time without recording attention, for efficiency). Output: $H_\text{ctx} \in \mathbb{R}^{B \times N_\text{ctx} \times D}$ where $N_\text{ctx} = |\mathcal{C}|$ varies per image (padded within the batch).
Target encoding. Pass the full images through the EMA target encoder $f_\xi$ (no masking). Apply stop-gradient. Output: $S \in \mathbb{R}^{B \times N \times D}$.
Sequential prediction (loop over $t = 1, \dots, K$).
- Construct mask tokens with positional embeddings for region $\mathcal{R}_t$.
- Project context tokens to predictor dimension $D_p$.
- Concatenate: projected context + buffer of prior predictions + current mask tokens.
- Forward through predictor $g_\phi$ (12-layer transformer).
- Extract output at mask-token positions, project back to $D$: $\hat{S}_t \in \mathbb{R}^{B \times |\mathcal{R}_t| \times D}$.
- Compute step loss: $\mathcal{L}_t = \frac{1}{|\mathcal{R}_t|} \sum_j \text{SmoothL1}(\hat{S}_{t,j}, S[\mathcal{R}_t]_j)$.
- Append $\hat{S}_t$ (with positional embeddings) to the prediction buffer.
Accumulate loss. $\mathcal{L} = \sum_{t=1}^{K} \mathcal{L}_t$. Average over the batch.
Backward pass. Compute gradients $\nabla_\theta \mathcal{L}$ and $\nabla_\phi \mathcal{L}$. Note that gradients flow through all $K$ sequential prediction steps (the predictions from earlier steps are used in later steps, creating a computational graph that spans the entire sequence). This is analogous to backpropagation through time in RNNs, though here the number of steps $K = 4$ is small.
Parameter update. Apply AdamW with the current learning rate $\eta$ (cosine schedule with 40-epoch warmup) and weight decay 0.05.
EMA update. $\xi \leftarrow \tau \cdot \xi + (1 - \tau) \cdot \theta$ where $\tau$ follows a cosine schedule from 0.996 to 1.0.

Training Architecture Diagram (Detailed)

Figure 3. Detailed training data flow for one DSeq-JEPA iteration. Numbered steps (①–⑥) correspond to the procedure in the text. Solid green borders indicate trainable components; dashed borders indicate EMA-updated/frozen components. Green dashed lines show gradient flow. The sequential predictor executes $K$ forward passes, with each step's output feeding into the next step's input.

Training Notes

Two forward passes through the encoder. DSeq-JEPA requires two passes through $f_\theta$ per iteration: one full-image pass to compute attention-based saliency (step ②), and one context-only pass for the actual context encoding (step ④). The first pass is necessary because saliency scores determine which patches are context vs. target, so the masking cannot be decided before the encoder has seen the image. This is a modest overhead: the first pass processes all $N = 196$ tokens, and the second pass processes only $N_\text{ctx} \approx 120$ tokens (since ~76 tokens are removed as targets). The paper reports approximately 1.4× the per-iteration cost of I-JEPA.

Gradient flow through sequential steps. Because predictions from earlier steps are concatenated into the input for later steps, gradients at step $K$ flow back through the predictor's computation graph at steps $K{-}1, K{-}2, \dots, 1$. The computational graph has depth proportional to $K$ (the number of sequential steps), which is small ($K = 4$), so gradient vanishing is not a practical concern. No gradient clipping beyond AdamW's implicit norm control is used.

8. Inference

At inference time, DSeq-JEPA discards the predictor, the target encoder, and the saliency-masking procedure entirely. Only the trained context encoder $f_\theta$ is retained as a general-purpose feature extractor. This is identical to how I-JEPA is deployed: the pretrained ViT encoder processes the full image without any masking, and the output token representations are used for downstream tasks.

Downstream Protocols

Linear probing. Freeze the encoder $f_\theta$. Pass each image through the encoder to obtain the [CLS] token representation (or average-pooled patch tokens) of dimension $D$. Train a single linear layer $W \in \mathbb{R}^{C \times D}$ (where $C$ is the number of classes) on top of the frozen features using cross-entropy loss. This protocol evaluates representation quality without any adaptation of the encoder. Settings: SGD with momentum 0.9, batch size 256, 100 epochs.

Fine-tuning. Initialize a classifier from the pretrained encoder, adding a linear classification head. Fine-tune all parameters end-to-end with a smaller learning rate for the encoder and a larger one for the head. Settings: AdamW, layer-wise learning rate decay 0.65, batch size 1024, 100 epochs, data augmentations (RandAugment, Mixup, CutMix).

Detection / segmentation. Use the pretrained encoder as a backbone within Mask R-CNN (for MS-COCO detection and instance segmentation) or UPerNet (for ADE20K semantic segmentation). Standard fine-tuning protocols from ViTDet and BEiT are followed.

Inference Pipeline Diagram

Figure 4. DSeq-JEPA inference pipeline. Only the pretrained context encoder is retained. The full image (no masking) is processed, and the resulting representations are used for downstream classification, detection, or segmentation. The predictor, target encoder, and saliency mechanism are discarded.

A key property of DSeq-JEPA's inference pipeline is that it is identical in architecture and computational cost to I-JEPA's inference. All the added complexity (saliency computation, sequential prediction) exists only during pretraining. At deployment, the encoder is a standard ViT—compatible with any existing ViT-based downstream pipeline.

9. Results & Benchmarks

9.1 ImageNet-1K Classification

Method	Backbone	Epochs	Linear Probe (%)	Fine-Tune (%)
MAE	ViT-L/16	1600	75.8	85.9
data2vec 2.0	ViT-L/16	800	76.2	86.4
I-JEPA	ViT-L/16	300	75.5	84.2
DSeq-JEPA	ViT-L/16	300	77.3	85.1
I-JEPA	ViT-H/14	300	77.3	85.5
DSeq-JEPA	ViT-H/14	300	78.8	86.3

DSeq-JEPA improves I-JEPA by +1.8 points on linear probing with ViT-L/16 at identical training cost (300 epochs). With ViT-H/14, the gain is +1.5 points on linear probing and +0.8 on fine-tuning. Notably, DSeq-JEPA with ViT-L/16 matches I-JEPA with ViT-H/14 on linear probing—achieving comparable performance with a significantly smaller model.

9.2 Fine-Grained Recognition

Method	Backbone	CUB-200 (%)	iNat2021 (%)	Stanford Cars (%)
MAE	ViT-L/16	62.8	63.5	67.2
I-JEPA	ViT-L/16	58.2	61.8	64.5
DSeq-JEPA	ViT-L/16	65.7	67.4	71.3
I-JEPA	ViT-H/14	63.4	65.9	69.1
DSeq-JEPA	ViT-H/14	69.2	70.8	74.6

The gains on fine-grained benchmarks are substantially larger than on ImageNet. On CUB-200 with ViT-L/16, DSeq-JEPA improves over I-JEPA by +7.5 points; on Stanford Cars by +6.8 points. This validates the paper's central hypothesis: saliency-based target selection and sequential prediction are especially beneficial when downstream tasks require discriminating subtle visual cues.

9.3 Dense Prediction

Method	Backbone	COCO AP^box	COCO AP^mask	ADE20K mIoU
MAE	ViT-L/16	53.3	47.2	53.6
I-JEPA	ViT-L/16	52.1	46.3	52.8
DSeq-JEPA	ViT-L/16	54.2	48.0	54.5
I-JEPA	ViT-H/14	54.8	48.5	55.2
DSeq-JEPA	ViT-H/14	56.1	49.8	56.7

On MS-COCO detection and segmentation, DSeq-JEPA with ViT-L/16 surpasses I-JEPA by +2.1 AP^box and +1.7 AP^mask. On ADE20K semantic segmentation, the gain is +1.7 mIoU. These improvements suggest that the saliency-driven training signal produces features that are better at localizing objects and parsing scene structure, not just classifying images.

9.4 Ablation Studies

Effect of Number of Sequential Steps $K$

$K$ (steps)	IN-1K Lin. (%)	CUB Lin. (%)	Wall-clock overhead
1 (single target)	76.0	61.5	1.1×
2	76.8	63.9	1.2×
4 (default)	77.3	65.7	1.4×
6	77.4	65.9	1.6×
8	77.2	65.4	1.9×

Performance plateaus at $K = 4$ and slightly degrades at $K = 8$, likely because later steps predict increasingly uninformative regions (approaching the random-masking regime). The default $K = 4$ balances accuracy and cost.

Ordering Direction

Order	IN-1K Lin. (%)	CUB Lin. (%)
Most → least discriminative (default)	77.3	65.7
Least → most discriminative (reversed)	75.5	61.5
Random order (saliency targets, random sequence)	76.6	63.2

Reversing the order nearly eliminates the gains of DSeq-JEPA, reducing it to I-JEPA-level performance. Random ordering of saliency-selected targets retains some benefit (from better target selection) but loses the curriculum effect. This is strong evidence that the most-to-least ordering is a key ingredient, not just the saliency selection.

Saliency Source

Saliency source	IN-1K Lin. (%)	CUB Lin. (%)
[CLS] attention, all layers (default)	77.3	65.7
[CLS] attention, last layer only	76.9	64.8
Column-wise mean attention, all layers	77.0	65.1
Gradient-based saliency (GradCAM-style)	76.7	64.3

Averaging [CLS] attention across all layers and heads provides the best saliency signal. Last-layer-only attention is slightly worse, possibly because early layers capture low-level patterns that help identify truly discriminative regions when combined with later semantic attention. Gradient-based alternatives work but add computation and perform marginally worse.

10. Connection to the JEPA Family

Lineage

DSeq-JEPA is a direct descendant of I-JEPA (Assran et al., 2023). It inherits the core architecture—context encoder, EMA target encoder, narrow predictor, smooth-$\ell_1$ loss in latent space—and modifies two specific aspects: the masking strategy and the prediction mode. The broader lineage traces through:

JEPA (LeCun, 2022): The conceptual framework proposing latent-space prediction as an alternative to generative and contrastive SSL.
I-JEPA (Assran et al., 2023): The first concrete image instantiation. Random multi-block masking, parallel prediction, ViT backbone.
DSeq-JEPA (He et al., 2026): Replaces random masking with attention-derived saliency masking; replaces parallel prediction with sequential autoregressive prediction ordered by discriminative importance.

DSeq-JEPA also connects to broader themes in the JEPA family:

V-JEPA / V-JEPA 2: Extend I-JEPA to video with spatiotemporal masking. DSeq-JEPA's saliency-based target selection could be adapted to video by computing spatiotemporal attention saliency, prioritizing prediction of the most informative space-time regions.
S-JEPA: Adds DINO-style self-distillation on top of the JEPA objective. DSeq-JEPA takes a different approach to improving target quality—through smarter selection rather than additional loss terms.
T-JEPA: Targets trajectory-level representations. While DSeq-JEPA operates on single images, its sequential prediction strategy shares the spirit of temporal progression.

Key Novelty: Content-Adaptive, Ordered Prediction

DSeq-JEPA's primary contribution to the JEPA family is demonstrating that what you predict and in what order matters as much as how you predict. Prior JEPA variants focused on the encoder architecture (ViT, hierarchical transformers), the loss function (smooth-$\ell_1$, contrastive, self-distillation), or the domain (images, video, audio, point clouds). DSeq-JEPA is the first to modify the prediction curriculum—introducing a semantically meaningful ordering over targets that mirrors how biological visual systems process scenes. This insight is architecture- and domain-agnostic: any JEPA variant that predicts multiple target blocks could, in principle, adopt saliency-based ordering and sequential prediction to improve the training signal.

Influence and Implications

DSeq-JEPA's results suggest several directions for the JEPA family:

Adaptive masking as a general principle. The consistent gains from saliency-based target selection argue against uniform random masking across the board. Future JEPA variants for video, audio, and 3D may benefit from analogous content-adaptive masking strategies.
Sequential vs. parallel prediction. The sequential approach adds a small overhead but significantly improves fine-grained recognition. This trade-off may be especially favorable for domains where part-level reasoning is critical (medical imaging, wildlife identification, industrial inspection).
Attention maps as free supervision. The encoder's own attention maps, which are already computed during the forward pass, can serve as a form of self-supervision for guiding the training curriculum. This is a computationally cheap source of semantic signal that has been underexploited in prior JEPA work.

11. Summary

Key Takeaway. DSeq-JEPA demonstrates that replacing I-JEPA's spatially random, order-agnostic target prediction with a saliency-ranked sequential prediction strategy yields substantial improvements in learned representation quality—particularly for fine-grained recognition tasks where attending to discriminative image regions is critical. Main Contributions.

Attention-based saliency masking: Target blocks are selected by their discriminative importance as measured by the encoder's own attention maps, concentrating the predictive objective on semantically rich regions.
Sequential next-region prediction: Targets are predicted one by one in descending order of saliency, with each step conditioned on prior predictions. This breaks permutation symmetry and creates a curriculum from primary to secondary visual cues.
Strong empirical results: +1.8 points on ImageNet-1K linear probing (ViT-L/16), +7.5 points on CUB-200, +5.6 points on iNaturalist 2021, and +6.8 points on Stanford Cars over I-JEPA—all at identical training epochs. Dense prediction tasks (COCO, ADE20K) also improve.
Zero inference overhead: All added complexity exists only during pretraining. The deployed encoder is a standard ViT, identical to I-JEPA's deployment.

The work establishes that the ordering and selection of prediction targets are first-class design decisions in the JEPA framework—not merely implementation details—and that content-adaptive, curriculum-style prediction is a simple yet powerful way to improve self-supervised visual representations.

12. References

He, S., Sakai, T., Chandhok, A., Beery, S., Yuan, J., Padoy, N., Hasegawa, Y., & Sigal, L. (2026). DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture. arXiv preprint arXiv:2511.17354.
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. Technical report, Meta AI.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language. ICML 2022.
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001.
Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S., & Mac Aodha, O. (2021). Benchmarking Representation Learning for Natural World Image Collections. CVPR 2021.
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3D Object Representations for Fine-Grained Categorization. ICCV Workshops 2013.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. ECCV 2014.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene Parsing through ADE20K Dataset. CVPR 2017.
Li, Y., Mao, H., Girshick, R., & He, K. (2022). Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified Perceptual Parsing for Scene Understanding. ECCV 2018.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. ICCV 2017.
Bardes, A., Ponce, J., & LeCun, Y. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ECCV 2024.

@misc{kinas2026jepa,
  author = {Kinas, Remek},
  title  = {JEPA Survey},
  year   = {2026},
  url    = {https://jepa.si5.pl}
}