Evaluation of Automatic Speech Recognition Using Generative Large Language Models

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Traditional ASR evaluation using Word Error Rate (WER) can miss meaning, so the paper explores more semantics-aligned metrics using generative LLMs.
  • The study evaluates LLMs for semantic ASR assessment via three methods: best-hypothesis selection, semantic distance with generative embeddings, and qualitative error classification.
  • On the HATS dataset, the strongest LLM approaches reach 92–94% agreement with human annotators for hypothesis selection, far exceeding WER’s 63%.
  • Decoder-based LLM embeddings perform similarly to encoder-based models, suggesting either architecture can be effective for embedding-based semantic evaluation.
  • The results indicate LLM-driven ASR evaluation could enable more interpretable, meaning-aware metrics beyond standard WER.

Abstract

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.