Generate Your Talking Avatar from Video Reference
arXiv cs.CV / 5/1/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper introduces TAVR (Talking Avatar generation from Video Reference), a framework that generates talking avatars using cross-scene video inputs instead of relying on a single static reference image in the target scene.
- TAVR uses a token selection module and a three-stage training approach: same-scene video pretraining for appearance copying, cross-scene fine-tuning for domain adaptation, and reinforcement learning to optimize identity similarity via identity-based rewards.
- To measure cross-scene robustness, the authors created a new benchmark with 158 curated cross-scene video pairs.
- Experiments indicate that TAVR improves talking-avatar quality and identity preservation, and it provides flexible inference-time video referencing, outperforming prior baselines quantitatively and qualitatively.
- The authors state the method has been deployed to production, and they reference HeyGen research materials for related work.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to

Automating FDA Compliance: AI for Specialty Food Producers
Dev.to