Realistic Lip Motion Generation Based on 3D Dynamic Viseme and Coarticulation Modeling for Human-Robot Interaction
arXiv cs.RO / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a speech-driven lip motion generation framework for humanoid robots using 3D dynamic viseme modeling and coarticulation to achieve realistic lip synchronization for non-verbal interaction.
- It builds a coherent 3D dynamic viseme library aligned with the ARKit standard by leveraging Chinese pronunciation theory, aiming to provide reliable prior trajectories for continuous speech.
- To address motion conflicts in continuous speech, it introduces a coarticulation mechanism combining initial–final (Shengmu–Yunmu) decoupling with energy modulation.
- The method includes a retargeting strategy that maps high-dimensional spatial lip motion to a 14-DOF lip actuation system on a humanoid head platform and validates performance via ablation studies using PCC and MAJ metrics.
- The authors release the 3D dynamic viseme library and deployment videos on GitHub, positioning the approach as lightweight and practical for real-world human-robot interaction scenarios.




