SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue
arXiv cs.CL / 3/18/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The paper introduces SpokenTOD, a large-scale spoken task-oriented dialogue dataset with 52,390 dialogues and 1,034 hours of speech, addressing limitations in data scale and domain coverage.
- SpokenTOD includes four spoken user behaviors—cross-turn slots, barge-in, disfluency, and emotional prosody—captured across diverse speakers and domains.
- Building on SpokenTOD, SpokenUS is a spoken user simulator for task-oriented dialogue, featuring an architecture specifically designed to handle barge-in.
- SpokenUS achieves comparable goal coverage to much larger models while outperforming baselines in human MOS, and it reveals slot values gradually over the dialogue rather than front-loading them.
- Analyses show SpokenUS’s spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.




