Motivation
Visual speech recognition — lip reading — is fundamentally under-determined: many phonemes share the same lip shape (the viseme problem), so the pixel stream is inherently ambiguous. An audio model, by contrast, observes a far richer and more discriminative signal. The natural idea is to transfer the audio model's knowledge into a video-only recognizer, but standard logit distillation copies only the teacher's outputs, not the structured intermediate representation that makes audio phonetically discriminative. JEP-KD aims to close the audio-visual performance gap by transferring that latent structure.
How it works
JEP-KD embeds a JEPA-style joint-embedding predictive objective inside a generalized knowledge-distillation framework.
- The context is the visual (lip) stream, encoded by a student video encoder $f_\theta$.
- A predictor $g_\phi$ maps these visual embeddings toward the latent representations produced by a pretrained audio teacher encoder, which acts as the target encoder.
- Over aligned speech segments, the student learns to predict the audio model's semantic features rather than merely match output logits.
The effective masking/prediction unit is the cross-modal gap itself: from video alone, predict the corresponding audio latents. This forces the visual encoder to absorb phonetic structure that pixels under-determine.
The objective
For aligned audio-visual segments, the latent-prediction term minimizes the distance between predicted visual-derived embeddings and audio targets:
$$\mathcal{L}_{\text{JEP}} = \big\lVert\, g_\phi\big(f_\theta(v)\big) - \operatorname{sg}\big[\bar f^{\text{audio}}(a)\big]\,\big\rVert_2^2,$$
where $v$ is the lip video, $a$ the aligned audio, $\bar f^{\text{audio}}$ the (EMA / frozen) teacher, and $\operatorname{sg}$ the stop-gradient. This latent objective is combined with the task and standard distillation losses, so the video encoder learns to predict the audio model's semantic features in addition to producing correct outputs.
Key results & what's novel
JEP-KD repurposes the JEPA principle from self-supervised pretraining to cross-modal supervised distillation. By treating audio-to-video transfer as latent prediction, the visual encoder absorbs phonetically discriminative structure that is under-determined in pixels — information that pure logit matching does not convey. This narrows the audio-visual performance gap and improves visual speech recognition. The broader novelty is methodological: it demonstrates that latent prediction is a general mechanism for transferring knowledge across modalities, not only an objective for unsupervised representation learning.
Strengths & limitations
- + Transfers structured intermediate features, not just logits, narrowing the audio-visual gap.
- + Reuses a pretrained audio teacher to supervise a video-only student.
- + Shows latent prediction generalizes to cross-modal supervised distillation.
- − Requires aligned audio-visual training data and a strong audio teacher.
- − The viseme ambiguity is reduced, not eliminated; some phonemes remain visually indistinguishable.
- − Inference is video-only, but training depends on the teacher's quality and alignment accuracy.