InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance
arXiv cs.CV / 3/25/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes InterDyad, a speech-to-video generation framework tailored for dyadic (two-person) interactive settings where existing methods struggle with cross-individual dependencies and fine-grained reactive control.
- InterDyad uses an Interactivity Injector to reenact video behavior with identity-agnostic motion priors extracted from reference videos, enabling more natural interaction dynamics.
- A MetaQuery-based modality alignment component leverages a Multimodal Large Language Model (MLLM) to distill linguistic intent from conversational audio and translate it into precise timing and appropriateness of reactions.
- To handle lip-sync under extreme head poses, the method introduces Role-aware Dyadic Gaussian Guidance (RoDG) to improve synchronization and spatial consistency.
- The authors report significant performance gains over state-of-the-art approaches and add a dedicated evaluation suite with new metrics to measure dyadic interaction quality, supported by demo videos on the project page.
Related Articles
5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)
Dev.to
AgentDesk vs Hiring Another Consultant: A Cost Comparison
Dev.to
"Why Your AI Agent Needs a System 1"
Dev.to
When should we expect TurboQuant?
Reddit r/LocalLLaMA
AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia
Dev.to