AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

arXiv cs.CL / 4/10/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

AfriVoices-KE is a new large-scale, multilingual speech dataset with about 3,000 hours of audio covering five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali.
The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech gathered from 4,777 native speakers across diverse regions and demographics to better reflect real linguistic variation.
Data collection used both scripted methods (text corpora, translations, and domain-relevant generated sentences across eleven Kenyan-context domains) and unscripted elicitation via textual and image prompts.
A smartphone-based mobile app supported contributor recording, while quality assurance used automated signal-to-noise checks before recording and human review for content accuracy.
The project targets underrepresentation of African languages in speech technology, aiming to enable more inclusive ASR and TTS systems and support digital preservation of Kenya’s linguistic heritage.

Abstract

AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/10DailyView insight →

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer