TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

arXiv cs.CL / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The paper introduces TTS-PRISM, a perceptual reasoning and interpretable text-to-speech (TTS) framework aimed at diagnosing fine-grained Mandarin acoustic artifacts beyond relying on monolithic metrics.
It defines a 12-dimensional diagnostic schema (from stability to advanced expressiveness) and uses a targeted synthesis pipeline with adversarial perturbations and expert anchors to construct a high-quality diagnostic dataset.
The method applies schema-driven instruction tuning so the model’s scoring criteria and reasoning are explicitly embedded into an efficient end-to-end system.
Experiments on a 1,600-sample Gold Test Set show TTS-PRISM achieves better human alignment than generalist TTS models, and profiling across six TTS paradigms yields intuitive diagnostic flags.
The project is released as open source, with code and checkpoints provided via the referenced GitHub repository.

Abstract

While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

Reddit r/artificial

langchain-tests==1.1.7

LangChain Releases

Why isn’t LLM reasoning done in vector space instead of natural language?

Reddit r/LocalLLaMA

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

Reddit r/LocalLLaMA

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

Key Points

Abstract

Related Articles

An improvement of the convergence proof of the ADAM-Optimizer

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

langchain-tests==1.1.7

Why isn’t LLM reasoning done in vector space instead of natural language?

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer