Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

arXiv cs.CL / 3/31/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces the Pashto Common Voice (MCV) corpus, described as the first large-scale, openly licensed speech dataset for Pashto, a 60M+ speaker low-resource language largely missing from open speech tech.
  • Over 2022–2025 and ten Mozilla Common Voice releases (CV14–CV23), the dataset expanded from 1.5 hours with 5 contributors to 147 hours with 1,483 speakers, with participation increasing ~108× between CV17 and CV18 after a VOA Pashto broadcast campaign.
  • The authors detail an end-to-end methodology including Pashto interface localization, Wikipedia-based sentence extraction with automated filtering, phoneme/character-targeted contributions for commonly dropped characters, and multi-channel outreach.
  • MCV23 includes 107,781 clips (82.33 validated hours) across 13 content domains, and fine-tuning Whisper Base on MCV20 is reported to achieve 13.4% WER versus 99.0% zero-shot WER for Whisper Base on Pashto.
  • By combining community sourcing with targeted data collection and model fine-tuning results, the work provides a practical pathway to improve Pashto ASR performance using open speech resources.

Abstract

We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base on the MCV20 yields 13.4% WER on the MCV20 test split, against the published Whisper Base zero-shot WER of 99.0% on Pashto.