Hierarchical Codec Diffusion for Video-to-Speech Generation
arXiv cs.CV / 4/20/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper addresses Video-to-Speech generation by arguing that prior methods miss the hierarchical structure of speech, from coarse speaker-aware meaning to fine-grained prosody.
- It introduces HiCoDiT, a Hierarchical Codec Diffusion Transformer that leverages the multi-level hierarchy of RVQ-based discrete speech tokens to improve audio-visual alignment.
- HiCoDiT uses separate low-level and high-level diffusion blocks: low-level tokens are conditioned on lip-synchronized motion and facial identity, while high-level tokens use facial expressions to shape prosodic behavior.
- To better transfer information from coarse to fine levels, the authors propose a dual-scale adaptive instance normalization that combines channel-wise (global vocal style) and temporal-wise (local prosody dynamics) normalization.
- Experiments reportedly show improved fidelity and expressiveness versus baselines, and the project provides code and a speech demo via the linked GitHub repository.



