Neural networks for Text-to-Speech evaluation

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles the high cost and assessor bias of human TTS evaluation (MOS/SBS) by training neural models to approximate expert judgments for both relative and absolute metrics.
  • For relative evaluation, it proposes NeuralSBS, a HuBERT-backed approach that reaches 73.7% accuracy on the SOMOS dataset.
  • For absolute evaluation, it improves MOSNet with sequence-length batching and introduces WhisperBert, a multimodal stacking ensemble combining Whisper audio features with BERT text embeddings.
  • The best MOS models achieve about 0.40 RMSE, outperforming a human inter-rater RMSE baseline of 0.62, with ablations showing cross-attention fusion can hurt performance.
  • The authors report negative results for SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), arguing for dedicated metric-learning frameworks for reliable TTS scoring.

Abstract

Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.