Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

arXiv cs.CL / 3/31/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces the Pashto Common Voice (MCV) corpus, described as the first large-scale, openly licensed speech dataset for Pashto, a 60M+ speaker low-resource language largely missing from open speech tech.
Over 2022–2025 and ten Mozilla Common Voice releases (CV14–CV23), the dataset expanded from 1.5 hours with 5 contributors to 147 hours with 1,483 speakers, with participation increasing ~108× between CV17 and CV18 after a VOA Pashto broadcast campaign.
The authors detail an end-to-end methodology including Pashto interface localization, Wikipedia-based sentence extraction with automated filtering, phoneme/character-targeted contributions for commonly dropped characters, and multi-channel outreach.
MCV23 includes 107,781 clips (82.33 validated hours) across 13 content domains, and fine-tuning Whisper Base on MCV20 is reported to achieve 13.4% WER versus 99.0% zero-shot WER for Whisper Base on Pashto.
By combining community sourcing with targeted data collection and model fine-tuning results, the work provides a practical pathway to improve Pashto ASR performance using open speech resources.

Abstract

We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, the corpus grew from 1.5 hours and 5 contributors to 147 total hours and 1,483 unique speakers across ten Mozilla Common Voice releases (CV14-CV23). Speaker participation increased approximately 108-fold between CV17 and CV18, coinciding with a VOA Pashto broadcast campaign. We describe the full methodology: interface localisation, Wikipedia-based sentence extraction with automated filtering, phonemically targeted contributions for the four most frequently dropped Pashto characters, and multi-channel community outreach. MCV23 contains 107,781 clips (60,337 validated; 82.33 validated hours) across 13 content domains. Fine-tuning Whisper Base on the MCV20 yields 13.4% WER on the MCV20 test split, against the published Whisper Base zero-shot WER of 99.0% on Pashto.

Black Hat Asia

AI Business

How to Verify Information Online and Avoid Fake Content

Dev.to

I built an AI code reviewer solo while working full-time — honest post-launch breakdown

Dev.to

Google Stitch vs Claude: Which AI Design Tool Wins in 2026?

Dev.to

Nebius announces construction of one of Europe's largest data centres

Tech.eu

Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language

Key Points

Abstract

Related Articles

Black Hat Asia

How to Verify Information Online and Avoid Fake Content

I built an AI code reviewer solo while working full-time — honest post-launch breakdown

Google Stitch vs Claude: Which AI Design Tool Wins in 2026?

Nebius announces construction of one of Europe's largest data centres

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer