TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

arXiv cs.CL / 4/27/2026

💬 OpinionModels & Research

Key Points

  • The paper introduces TTS-PRISM, a perceptual reasoning and interpretable text-to-speech (TTS) framework aimed at diagnosing fine-grained Mandarin acoustic artifacts beyond relying on monolithic metrics.
  • It defines a 12-dimensional diagnostic schema (from stability to advanced expressiveness) and uses a targeted synthesis pipeline with adversarial perturbations and expert anchors to construct a high-quality diagnostic dataset.
  • The method applies schema-driven instruction tuning so the model’s scoring criteria and reasoning are explicitly embedded into an efficient end-to-end system.
  • Experiments on a 1,600-sample Gold Test Set show TTS-PRISM achieves better human alignment than generalist TTS models, and profiling across six TTS paradigms yields intuitive diagnostic flags.
  • The project is released as open source, with code and checkpoints provided via the referenced GitHub repository.

Abstract

While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.