Generate Your Talking Avatar from Video Reference

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces TAVR (Talking Avatar generation from Video Reference), a framework that generates talking avatars using cross-scene video inputs instead of relying on a single static reference image in the target scene.
TAVR uses a token selection module and a three-stage training approach: same-scene video pretraining for appearance copying, cross-scene fine-tuning for domain adaptation, and reinforcement learning to optimize identity similarity via identity-based rewards.
To measure cross-scene robustness, the authors created a new benchmark with 158 curated cross-scene video pairs.
Experiments indicate that TAVR improves talking-avatar quality and identity preservation, and it provides flexible inference-time video referencing, outperforming prior baselines quantitatively and qualitatively.
The authors state the method has been deployed to production, and they reference HeyGen research materials for related work.

Abstract

Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally, task-specific reinforcement learning aligns the generated outputs with identity-based rewards to maximize identity similarity. To systematically evaluate cross-scene robustness, we construct a new benchmark comprising 158 carefully curated cross-scene video pairs. Extensive experiments show that TAVR benefits from flexible inference-time video referencing and consistently surpasses existing baselines both quantitatively and qualitatively. This work has been deployed to production. For more related research, please visit \href{https://www.heygen.com/research}{HeyGen Research} and \href{https://www.heygen.com/research/avatar-v-model}{HeyGen Avatar-V}.