Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Cross-Modal Emotion Transfer (C-MET) to enable more flexible emotion editing in talking-face video generation by mapping emotion semantics between speech and facial visual feature spaces.
  • C-MET addresses limitations of prior label-based methods (discrete emotion categories), audio-only methods (entanglement of emotion and linguistic content), and image-reference methods (requirements for specific views and reference data for extended emotions).
  • The approach learns emotion semantic vectors representing differences between emotional embeddings across modalities using a large pretrained audio encoder and a disentangled facial expression encoder.
  • Experiments on MEAD and CREMA-D show a 14% improvement in emotion accuracy over state-of-the-art methods and demonstrate expressive results for unseen extended emotions such as sarcasm.
  • The authors provide code, checkpoints, and a demo to support reproducibility and downstream experimentation.

Abstract

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video | AI Navigate