AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
arXiv cs.CL / 4/10/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- AfriVoices-KE is a new large-scale, multilingual speech dataset with about 3,000 hours of audio covering five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali.
- The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech gathered from 4,777 native speakers across diverse regions and demographics to better reflect real linguistic variation.
- Data collection used both scripted methods (text corpora, translations, and domain-relevant generated sentences across eleven Kenyan-context domains) and unscripted elicitation via textual and image prompts.
- A smartphone-based mobile app supported contributor recording, while quality assurance used automated signal-to-noise checks before recording and human review for content accuracy.
- The project targets underrepresentation of African languages in speech technology, aiming to enable more inclusive ASR and TTS systems and support digital preservation of Kenya’s linguistic heritage.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works
Dev.to
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial