Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key weakness of zero-shot, prompt-driven TTS systems: existing prompt selection methods may not provide stable speaker-identity cues or well-calibrated emotion-intensity signals.
It proposes a two-stage prompt selection strategy for expressive speech synthesis, combining static evaluation (pitch/prosody features, perceptual audio quality, LLM-based text-emotion coherence, and model-based metrics like character error rate and speaker/emotion similarity) with a dynamic selection step during synthesis based on textual similarity.
Experiments show the approach improves emotion intensity while maintaining robust speaker identity consistency in zero-shot TTS outputs.
The authors plan to release audio samples and code, enabling follow-on evaluation and practical reuse of the prompting strategy for expressive, identity-consistent TTS workflows.

Abstract

Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models heavily rely on prompt design to guide the generation process. However, existing prompt selection methods often fail to ensure that prompts contain sufficiently stable speaker identity cues and appropriate emotional intensity indicators, which are crucial for expressive speech synthesis. To address this challenge, we propose a two-stage prompt selection strategy specifically designed for expressive speech synthesis. In the static stage (before synthesis), we first evaluate prompt candidates using pitch-based prosodic features, perceptual audio quality, and text-emotion coherence scores evaluated by an LLM. We further assess the candidates under a specific TTS model by measuring character error rate, speaker similarity, and emotional similarity between the synthesized and prompt speech. In the dynamic stage (during synthesis), we use a textual similarity model to select the prompt that is most aligned with the current input text. Experimental results demonstrate that our strategy effectively selects prompt to synthesize speech with both high-intensity emotional expression and robust speaker identity, leading to more expressive and stable zero-shot TTS performance. Audio samples and codes will be available at https://whyrrrrun.github.io/ExpPro.github.io/.