The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates widely used objective metrics for emotional expressiveness in speech generation, especially those based on cosine similarity of emotion embeddings between reference and generated audio.
  • It argues that embeddings from models such as emotion2vec can be confounded by linguistic content and speaker identity, causing “emotion similarity” scores to reflect non-emotional factors.
  • Through controlled adversarial evaluations and human alignment tests, the authors find that these latent spaces may achieve high classification accuracy but still fail for zero-shot similarity-based evaluation.
  • The study concludes the metric is overly sensitive to acoustic mimicry rather than genuine emotional synthesis, leading to a mismatch with how humans perceive emotion in speech.

Abstract

Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.