JEP-KD — World Modeling

At a glance

ProblemLip reading is hard because the visual channel is ambiguous (visemes), while an audio model sees a far richer signal; transferring audio knowledge into a video-only recognizer is non-trivial.

Key ideaInsert a JEPA-style latent-prediction objective into knowledge distillation: the video encoder predicts the audio teacher's latent features.

ModalityCross-modal audio-visual speech (lip video, audio teacher)

Target / maskingThe cross-modal gap itself: predict audio (teacher) latents from video (student) over aligned speech segments.

Builds onJEPA latent prediction; knowledge distillation for speech recognition.

Used forVisual speech recognition / lip reading.

Motivation

Visual speech recognition — lip reading — is fundamentally under-determined: many phonemes share the same lip shape (the viseme problem), so the pixel stream is inherently ambiguous. An audio model, by contrast, observes a far richer and more discriminative signal. The natural idea is to transfer the audio model's knowledge into a video-only recognizer, but standard logit distillation copies only the teacher's outputs, not the structured intermediate representation that makes audio phonetically discriminative. JEP-KD aims to close the audio-visual performance gap by transferring that latent structure.

How it works

Canonical JEPA schematic for Lip video + audio. The input is split into a visible context and hidden targets (frame-level, cross-modal). The context encoder $f_\theta$ embeds what is visible; the target encoder $\bar f_\theta$ (an EMA copy, gradient stopped) embeds the targets; the predictor $g_\phi$ maps context to the target embeddings; training minimises the latent distance. A local/generative loss runs alongside latent prediction (hybrid objective).

JEP-KD embeds a JEPA-style joint-embedding predictive objective inside a generalized knowledge-distillation framework.

The context is the visual (lip) stream, encoded by a student video encoder $f_\theta$.
A predictor $g_\phi$ maps these visual embeddings toward the latent representations produced by a pretrained audio teacher encoder, which acts as the target encoder.
Over aligned speech segments, the student learns to predict the audio model's semantic features rather than merely match output logits.

The effective masking/prediction unit is the cross-modal gap itself: from video alone, predict the corresponding audio latents. This forces the visual encoder to absorb phonetic structure that pixels under-determine.

The objective

For aligned audio-visual segments, the latent-prediction term minimizes the distance between predicted visual-derived embeddings and audio targets:

$$\mathcal{L}_{\text{JEP}} = \big\lVert\, g_\phi\big(f_\theta(v)\big) - \operatorname{sg}\big[\bar f^{\text{audio}}(a)\big]\,\big\rVert_2^2,$$

where $v$ is the lip video, $a$ the aligned audio, $\bar f^{\text{audio}}$ the (EMA / frozen) teacher, and $\operatorname{sg}$ the stop-gradient. This latent objective is combined with the task and standard distillation losses, so the video encoder learns to predict the audio model's semantic features in addition to producing correct outputs.

Key results & what's novel

JEP-KD repurposes the JEPA principle from self-supervised pretraining to cross-modal supervised distillation. By treating audio-to-video transfer as latent prediction, the visual encoder absorbs phonetically discriminative structure that is under-determined in pixels — information that pure logit matching does not convey. This narrows the audio-visual performance gap and improves visual speech recognition. The broader novelty is methodological: it demonstrates that latent prediction is a general mechanism for transferring knowledge across modalities, not only an objective for unsupervised representation learning.

Strengths & limitations

+ Transfers structured intermediate features, not just logits, narrowing the audio-visual gap.
+ Reuses a pretrained audio teacher to supervise a video-only student.
+ Shows latent prediction generalizes to cross-modal supervised distillation.
− Requires aligned audio-visual training data and a strong audio teacher.
− The viseme ambiguity is reduced, not eliminated; some phonemes remain visually indistinguishable.
− Inference is video-only, but training depends on the teacher's quality and alignment accuracy.

Connections & references

Builds onI-JEPA

Paper ↗