Neural networks for Text-to-Speech evaluation
arXiv cs.AI / 4/13/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper tackles the high cost and assessor bias of human TTS evaluation (MOS/SBS) by training neural models to approximate expert judgments for both relative and absolute metrics.
- For relative evaluation, it proposes NeuralSBS, a HuBERT-backed approach that reaches 73.7% accuracy on the SOMOS dataset.
- For absolute evaluation, it improves MOSNet with sequence-length batching and introduces WhisperBert, a multimodal stacking ensemble combining Whisper audio features with BERT text embeddings.
- The best MOS models achieve about 0.40 RMSE, outperforming a human inter-rater RMSE baseline of 0.62, with ablations showing cross-attention fusion can hurt performance.
- The authors report negative results for SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), arguing for dedicated metric-learning frameworks for reliable TTS scoring.
Related Articles

Black Hat Asia
AI Business

I built the missing piece of the MCP ecosystem
Dev.to

When Agents Go Wrong: AI Accountability and the Payment Audit Trail
Dev.to

Google Gemma 4 Review 2026: The Open Model That Runs Locally and Beats Closed APIs
Dev.to

OpenClaw Deep Dive Guide: Self-Host Your Own AI Agent on Any VPS (2026)
Dev.to