Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics
arXiv cs.CL / 3/25/2026
📰 News
Key Points
- The paper evaluates conversational ASR approaches under realistic multi-speaker conditions (overlap, far-field noise, and varying speaker counts), focusing on semantic fidelity and overlap robustness rather than only transcription accuracy.
- It compares LLM-based systems against modular pipeline approaches and finds LLM-based methods are competitive for two-speaker scenarios but deteriorate as speaker count and speech overlap increase.
- To better measure meaning changes that standard word-error metrics may overlook, the authors introduce tcpSemER, an embedding-based semantic variant of tcpWER.
- The work also breaks tcpWER into overlapping vs. non-overlapping error components to provide more granular diagnostics of where models fail.
- Experiments across three datasets support the conclusion that modular pipelines are generally more robust than LLM-based systems in highly overlapped, multi-speaker settings.
- categories: [