Fine-tuning Whisper for Pashto ASR: strategies and scale

arXiv cs.CL / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Whisper の事前学習コーパスにパシュト語が含まれていないため、そのままだとパシュト語音声が別スクリプトに出力され、WER が極めて高く実運用に不向きであることを指摘しています。
whisper-base に対して 4 つの微調整戦略（フルファインチューニング、LoRA、エンコーダ凍結、ウルドゥ→パシュト転移）を比較し、フルファインチューニングが CV20 で最良となり、WER 21.22% を達成したと報告しています。
エンコーダ凍結は（2/6 〜 6 層の設定では）層の役割分離仮説が成り立たず学習能力が減るため性能が悪化し、ウルドゥ→パシュト転移も中間チェックポイント未検証・音韻不一致などで失敗したと説明しています。
データ規模を 113 時間（CV24）に拡張した結果、whisper-small が実用上の最適点（WER 24.89%）で、whisper-large-v3-turbo は 23.37% まで改善するが逓減的な伸びに留まることが示されました。
学習に合わせたオンライン拡張で追加の WER 改善が得られ、主要な誤りは語末接尾辞の性による混同や、パシュト特有の /ts/ を含む置換だと誤り分析で特定しています（ファインチューニング済みチェックポイントと評価スクリプトは HuggingFace で公開）。

Abstract

Pashto is absent from Whisper's pre-training corpus despite being one of CommonVoice's largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Fine-tuning Whisper for Pashto ASR: strategies and scale

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer