PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

arXiv cs.CL / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces PSP (Phoneme Substitution Profile), a new interpretable per-phonological-dimension benchmark for measuring accent quality in Indic text-to-speech beyond standard intelligibility and naturalness metrics.
  • PSP breaks accent into six dimensions—retroflex collapse rate, aspiration fidelity, vowel-length fidelity, Tamil-zha fidelity, Frechet Audio Distance, and prosodic signature divergence—and uses forced alignment with native-speaker centroid probes plus corpus-level distributional distance measures.
  • The v1 benchmarks five systems (including ElevenLabs v3, Cartesia, Sarvam Bulbul, Indic Parler-TTS, and Praxy Voice) on Hindi, Telugu, and Tamil pilot sets and studies an additional Telugu case (R5->R6).
  • Results show accent difficulty increases monotonically (Hindi < Telugu < Tamil), PSP rankings can diverge from WER-based rankings, and no single TTS system is best across all six accent dimensions.
  • The authors release reference centroids, embeddings, prosodic feature matrices, golden sets, and MIT-licensed scoring code to support further reproducible accent-focused evaluation (with MOS-correlation planned for v2).

Abstract

Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.