The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

arXiv cs.CL / 4/1/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The Thiomi Dataset introduces a large-scale multimodal corpus covering ten African languages from four language families, pairing sentence-level text annotations with audio recordings.
It includes 601,000+ approved text annotations and 385,000+ audio recordings, gathered via a community data collection platform with 100+ contributors and supplemented Swahili audio from Common Voice.
A multi-tier quality assurance pipeline achieves high text approval rates (86–100%) for six core languages, supporting dataset reliability at scale.
The authors train baseline ASR, MT, and TTS models across all ten languages and report strong ASR results, including 3.24% WER on Swahili and 4.3% WER on Somali.
The dataset and accompanying methodology are planned for publication on Hugging Face, aiming to strengthen African language technology infrastructure.

Abstract

We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

Black Hat Asia

AI Business

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Dev.to

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

87.4% of My Agent's Decisions Run on a 0.8B Model

Dev.to

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Key Points

Abstract

Related Articles

Black Hat Asia

AI server farms heat up the neighborhood for miles around, paper finds

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Does the Claude “leak” actually change anything in practice?

87.4% of My Agent's Decisions Run on a 0.8B Model

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer