Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

arXiv cs.CL / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard ASR evaluation via word error rate (WER) can miss sentence-level semantic errors, motivating semantic-aware assessment beyond token accuracy.
  • It introduces an agentic interactive ASR framework that uses an LLM-as-a-judge to evaluate semantic coherence and recognition quality.
  • The authors also design an LLM-driven multi-turn interaction mechanism to simulate human-like correction, iteratively refining ASR outputs using semantic feedback.
  • Experiments on benchmarks such as GigaSpeech (English), WenetSpeech (Chinese), and ASRU 2019 code-switching show improvements in semantic fidelity and interactive correction capability via both objective and subjective measures.
  • The authors plan to release code to support further research in interactive and agentic speech recognition systems.

Abstract

Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.