The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates widely used objective metrics for emotional expressiveness in speech generation, especially those based on cosine similarity of emotion embeddings between reference and generated audio.
It argues that embeddings from models such as emotion2vec can be confounded by linguistic content and speaker identity, causing “emotion similarity” scores to reflect non-emotional factors.
Through controlled adversarial evaluations and human alignment tests, the authors find that these latent spaces may achieve high classification accuracy but still fail for zero-shot similarity-based evaluation.
The study concludes the metric is overly sensitive to acoustic mimicry rather than genuine emotional synthesis, leading to a mismatch with how humans perceive emotion in speech.

Abstract

Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

Dev.to

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Key Points

Abstract

Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer