InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

arXiv cs.CV / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes InterDyad, a speech-to-video generation framework tailored for dyadic (two-person) interactive settings where existing methods struggle with cross-individual dependencies and fine-grained reactive control.
InterDyad uses an Interactivity Injector to reenact video behavior with identity-agnostic motion priors extracted from reference videos, enabling more natural interaction dynamics.
A MetaQuery-based modality alignment component leverages a Multimodal Large Language Model (MLLM) to distill linguistic intent from conversational audio and translate it into precise timing and appropriateness of reactions.
To handle lip-sync under extreme head poses, the method introduces Role-aware Dyadic Gaussian Guidance (RoDG) to improve synchronization and spatial consistency.
The authors report significant performance gains over state-of-the-art approaches and add a dedicated evaluation suite with new metrics to measure dyadic interaction quality, supported by demo videos on the project page.

Abstract

Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

Dev.to

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

"Why Your AI Agent Needs a System 1"

Dev.to

When should we expect TurboQuant?

Reddit r/LocalLLaMA

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

Dev.to

InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

Key Points

Abstract

Related Articles

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

AgentDesk vs Hiring Another Consultant: A Cost Comparison

"Why Your AI Agent Needs a System 1"

When should we expect TurboQuant?

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer