YoNER: A New Yor\`ub\'a Multi-domain Named Entity Recognition Dataset

arXiv cs.CL / 4/8/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces YoNER, a new multi-domain Yoruba Named Entity Recognition dataset (about 5,000 sentences / 100,000 tokens) spanning Bible, Blogs, Movies, Radio broadcasts, and Wikipedia, and annotated in CoNLL style with PER/ORG/LOC entity types.
  • Manual annotation by three native Yoruba speakers achieved inter-annotator agreement above 0.70, aiming for high-quality and consistent labels across domains.
  • Cross-domain benchmarking with transformer encoder models (including comparisons against MasakhaNER 2.0) shows African-centric models generally outperform general multilingual ones, but performance drops sharply in certain domains like blogs and movies.
  • Domain-transfer experiments indicate that closer formal domains (news and Wikipedia) transfer more effectively than other domains, highlighting domain sensitivity for Yoruba NER.
  • The authors also release pretrained resources, including a Yoruba-specific language model (OyoBERT), which outperforms multilingual models on in-domain evaluation, alongside public release of YoNER.

Abstract

Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yor\`ub\'a, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yor\`ub\'a-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yor\`ub\'a natural language processing.