Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study proposes a controlled, multidimensional pairwise evaluation framework to reduce high variance when crowdsourcing preference judgments for multilingual TTS.
  • Using 5K+ native and code-mixed sentences across 10 Indic languages, the authors benchmark 7 state-of-the-art TTS systems with 120K+ pairwise comparisons from 1,900+ native raters.
  • Raters score models not only on overall preference but also across six perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations.
  • The paper builds a multilingual leaderboard via Bradley–Terry modeling and uses SHAP analysis plus reliability checks to connect human preferences to specific model strengths and trade-offs.
  • The work highlights how linguistic diversity and multi-attribute perception can be jointly handled to produce more interpretable and dependable TTS evaluation results.

Abstract

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.