MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages
arXiv cs.CL / 3/24/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces MzansiText, a curated multilingual pretraining corpus for South Africa’s eleven official written languages, along with a reproducible filtering pipeline.
- It also releases MzansiLM, a 125M-parameter decoder-only language model trained from scratch specifically for South African languages.
- Evaluations show that monolingual task-specific fine-tuning enables strong data-to-text generation, including 20.65 BLEU on isiXhosa and results that can compete with encoder-decoder models much larger in size.
- Multilingual task-specific fine-tuning improves closely related languages on topic classification, reaching 78.5% macro-F1 on isiXhosa news classification.
- The authors find that while the model adapts well to supervised NLU/NLG, few-shot reasoning remains difficult at this scale, motivating the released baseline and guidance for low-resource adaptation.
Related Articles
ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial

Best Open Source LLM Observability Tools in 2026: Complete Guide
Dev.to

Arm breaks from its licensing-only model with first in-house chip built for AI data centers
THE DECODER