Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

arXiv cs.CL / 4/23/2026

💬 OpinionModels & Research

Key Points

  • The paper addresses high ASR error rates in applications using child speech by proposing methods to pre-identify which utterance-level ASR outputs are reliable.
  • It introduces two utterance-level selection approaches: one tailored for reliable read speech and another for reliable dialogue speech.
  • Experiments on English and Dutch datasets (with both baseline and fine-tuned ASR models) show that the best strategy achieves high precision (P > 97.4) for both speech types and both languages.
  • The optimal selection strategy enables automatic selection of 21.0% to 55.9% of dialogue/read speech datasets while keeping utterance error rates low (UER < 2.6).

Abstract

Automatic Speech Recognition (ASR) is increasingly used in applications involving child speech, such as language learning and literacy acquisition. However, the effectiveness of such applications is limited by high ASR error rates. The negative effects can be mitigated by identifying in advance which ASR-outputs are reliable. This work aims to develop two novel approaches for selecting reliable ASR-output at the utterance level, one for selecting reliable read speech and one for dialogue speech material. Evaluations were done on an English and a Dutch dataset, each with a baseline and finetuned model. The results show that utterance-level selection methods for identifying reliably transcribed speech recordings have high precision for the best strategy (P > 97.4) for both read speech and dialogue material, for both languages. Using the current optimal strategy allows 21.0% to 55.9% of dialogue/read speech datasets to be automatically selected with low (UER of < 2.6) error rates.