Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses limitations of existing audio-driven human video generation by moving from monologue-only output to full-duplex interactive avatars that both speak and respond to incoming conversational audio.
  • It proposes a multi-head Gaussian kernel that injects a temporal inductive bias to handle the scale discrepancy between talking and listening, avoiding rigid long-range response behavior while preserving lip synchronization.
  • The authors build a virtual agent that simultaneously processes dual-stream audio for talking and listening, enabling more natural conversational turn-taking in generated digital humans.
  • They introduce a cleaned VoxHear dataset with perfectly decoupled speech and background audio tracks to improve training/evaluation for interactive talking–listening settings.
  • Experiments claim the approach achieves a new state of the art for generating highly natural, responsive full-duplex interactive digital humans.
  • Point 1

Abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The project page is available at https://warmcongee.github.io/beyond-monologue/ .