FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish

arXiv cs.CL / 4/1/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces FLEURS-Kobani, a new spoken extension of the FLEURS benchmark that adds Northern Kurdish (ISO 639-3 KMR) to enable ASR, speech translation (S2TT), and speech-to-speech translation (S2ST) evaluation in this under-resourced language.
  • FLEURS-Kobani contains 5,162 validated utterances (18 hours 24 minutes) recorded by 31 native speakers and is publicly released under a CC BY 4.0 license for research use.
  • The work provides baseline results by fine-tuning Whisper v3-large for ASR and E2E S2TT, including a two-stage fine-tuning approach (Common Voice → FLEURS-Kobani) that achieves WER 28.11 and CER 9.84 on the test set.
  • For KMR→EN speech translation, Whisper reaches 8.68 BLEU on test, and the paper also reports pivot-derived targets and a cascaded S2TT configuration to broaden evaluation setups.
  • FLEURS-Kobani is positioned as the first public Northern Kurdish benchmark, filling a gap in prior FLEURS coverage and supporting standardized benchmarking for multiple speech tasks.

Abstract

FLEURS offers n-way parallel speech for 100+ languages, but Northern Kurdish is not one of them, which limits benchmarking for automatic speech recognition and speech translation tasks in this language. We present FLEURS-Kobani, a Northern Kurdish (ISO 639-3 KMR) spoken extension of the FLEURS benchmark. The FLEURS-Kobani dataset consists of 5,162 validated utterances, totaling 18 hours and 24 minutes. The data were recorded by 31 native speakers. It extends benchmark coverage to an under-resourced Kurdish variety. As baselines, we fine-tuned Whisper v3-large for ASR and E2E S2TT. A two-stage fine-tuning strategy (Common Voice to FLEURS-Kobani) yields the best ASR performance (WER 28.11, CER 9.84 on test). For E2E S2TT (KMR to EN), Whisper achieves 8.68 BLEU on test; we additionally report pivot-derived targets and a cascaded S2TT setup. FLEURS-Kobani provides the first public Northern Kurdish benchmark for evaluation of ASR, S2TT and S2ST tasks. The dataset is publicly released for research use under a CC BY 4.0 license.