Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
arXiv cs.CV / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces Cross-Modal Emotion Transfer (C-MET) to enable more flexible emotion editing in talking-face video generation by mapping emotion semantics between speech and facial visual feature spaces.
- C-MET addresses limitations of prior label-based methods (discrete emotion categories), audio-only methods (entanglement of emotion and linguistic content), and image-reference methods (requirements for specific views and reference data for extended emotions).
- The approach learns emotion semantic vectors representing differences between emotional embeddings across modalities using a large pretrained audio encoder and a disentangled facial expression encoder.
- Experiments on MEAD and CREMA-D show a 14% improvement in emotion accuracy over state-of-the-art methods and demonstrate expressive results for unseen extended emotions such as sarcasm.
- The authors provide code, checkpoints, and a demo to support reproducibility and downstream experimentation.



