AsymK-Talker: Real-Time and Long-Horizon Talking Head Generation via Asymmetric Kernel Distillation

arXiv cs.LG / 5/6/2026

📰 NewsModels & Research

Key Points

  • AsymK-Talker is a new diffusion-distillation approach aimed at improving audio-driven talking-head generation for real-time use and long-horizon stability.
  • The method introduces Kernel-Conditioned Loop Generation (KCLG), a causal chunk-wise generation strategy that uses motion kernels to maintain temporal consistency during inference.
  • It adds Temporal Reference Encoding (TRE) to transform a static identity reference into a time-aware latent representation, strengthening audio-visual synchronization.
  • It uses Asymmetric Kernel Distillation (AKD), where the teacher is supervised with ground-truth motion kernels while the student learns from its own generated kernels to reduce progressive drift over long sequences.
  • The paper reports promising improvements in visual fidelity and lip synchronization metrics compared with prior audio-driven talking-head methods.

Abstract

Recent advances in diffusion models have markedly enhanced the visual fidelity of audio-driven talking head generation. Nevertheless, existing methods are constrained by three critical limitations: causal inefficiency that impedes real-time inference, incompatibility with temporally coherent conditioning, and progressive drift over long-horizon generation, collectively hindering their deployment in real-time applications. To overcome these challenges, we introduce AsymK-Talker, a novel diffusion-distillation method designed for real-time and long-horizon talking head generation. AsymK-Talker comprises three key components: (1) Kernel-Conditioned Loop Generation (KCLG), a causal, chunk-wise generation paradigm that leverages motion kernels to enable temporally consistent propagation; (2) Temporal Reference Encoding (TRE), which converts a static identity reference into a time-aware latent representation to enhance audio-visual synchronization; and (3) Asymmetric Kernel Distillation (AKD), a teacher-student distillation framework wherein the teacher model conditions on ground-truth motion kernels for supervision, while the student learns to generate from generated kernels, thereby ensuring robustness during extended generation sequences. AsymK-Talker achieves promising results on both visual fidelity and lip synchronization metrics.