MOSS-TTSD: Text to Spoken Dialogue Generation
arXiv cs.CL / 3/23/2026
📰 NewsModels & Research
Key Points
- The paper introduces MOSS-TTSD, a spoken dialogue synthesis model for expressive, multi-party conversational speech across languages, addressing long-context modeling and cross-turn coherence.
- It enables long-form single-pass synthesis of up to 60 minutes, supports up to five speakers, and includes zero-shot voice cloning from a short reference clip.
- It also proposes TTSD-eval, an objective evaluation framework based on forced alignment to measure speaker attribution and similarity without relying on diarization tools.
- The model is shown to surpass strong open-source and proprietary baselines in both objective and subjective evaluations of dialogue synthesis across languages.
- With applications in podcasts, dynamic commentary, and entertainment, MOSS-TTSD signals a major advance in real-time, multi-speaker voice generation.
Related Articles
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents
THE DECODER

How to Choose the Best AI Chat Models of 2026 for Your Business Needs
Dev.to

I built an AI that generates lesson plans in your exact teaching voice (open source)
Dev.to

6-Band Prompt Decomposition: The Complete Technical Guide
Dev.to