Generate Your Talking Avatar from Video Reference

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces TAVR (Talking Avatar generation from Video Reference), a framework that generates talking avatars using cross-scene video inputs instead of relying on a single static reference image in the target scene.
  • TAVR uses a token selection module and a three-stage training approach: same-scene video pretraining for appearance copying, cross-scene fine-tuning for domain adaptation, and reinforcement learning to optimize identity similarity via identity-based rewards.
  • To measure cross-scene robustness, the authors created a new benchmark with 158 curated cross-scene video pairs.
  • Experiments indicate that TAVR improves talking-avatar quality and identity preservation, and it provides flexible inference-time video referencing, outperforming prior baselines quantitatively and qualitatively.
  • The authors state the method has been deployed to production, and they reference HeyGen research materials for related work.

Abstract

Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{https://www.heygen.com/research}{HeyGen Research} and \href{https://www.heygen.com/research/avatar-v-model}{HeyGen Avatar-V}.