Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

arXiv cs.CL / 3/25/2026

📰 News

Key Points

  • The paper evaluates conversational ASR approaches under realistic multi-speaker conditions (overlap, far-field noise, and varying speaker counts), focusing on semantic fidelity and overlap robustness rather than only transcription accuracy.
  • It compares LLM-based systems against modular pipeline approaches and finds LLM-based methods are competitive for two-speaker scenarios but deteriorate as speaker count and speech overlap increase.
  • To better measure meaning changes that standard word-error metrics may overlook, the authors introduce tcpSemER, an embedding-based semantic variant of tcpWER.
  • The work also breaks tcpWER into overlapping vs. non-overlapping error components to provide more granular diagnostics of where models fail.
  • Experiments across three datasets support the conclusion that modular pipelines are generally more robust than LLM-based systems in highly overlapped, multi-speaker settings.
  • categories: [

Abstract

Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.

Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics | AI Navigate