Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
arXiv cs.CL / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a key weakness of zero-shot, prompt-driven TTS systems: existing prompt selection methods may not provide stable speaker-identity cues or well-calibrated emotion-intensity signals.
- It proposes a two-stage prompt selection strategy for expressive speech synthesis, combining static evaluation (pitch/prosody features, perceptual audio quality, LLM-based text-emotion coherence, and model-based metrics like character error rate and speaker/emotion similarity) with a dynamic selection step during synthesis based on textual similarity.
- Experiments show the approach improves emotion intensity while maintaining robust speaker identity consistency in zero-shot TTS outputs.
- The authors plan to release audio samples and code, enabling follow-on evaluation and practical reuse of the prompting strategy for expressive, identity-consistent TTS workflows.
Related Articles

Black Hat Asia
AI Business

How Bash Command Safety Analysis Works in AI Systems
Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to