Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

arXiv cs.CL / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that speech-to-text progress appears to be limited on academic benchmarks because they overrepresent common vocabulary, while real-world usability depends heavily on recognizing rare, context-specific custom terms.
  • It introduces Contextual Earnings-22, a new open dataset derived from Earnings-22 that adds realistic custom vocabulary contexts to better measure real-world transcription performance.
  • The authors provide six strong baseline models for two leading strategies—keyword prompting and keyword boosting—to support comparable research evaluation.
  • Experiments indicate that both approaches achieve comparable accuracy and show significant gains when moving from small proof-of-concept setups to large-scale systems.

Abstract

The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.