MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages

arXiv cs.CL / 3/24/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces MzansiText, a curated multilingual pretraining corpus for South Africa’s eleven official written languages, along with a reproducible filtering pipeline.
It also releases MzansiLM, a 125M-parameter decoder-only language model trained from scratch specifically for South African languages.
Evaluations show that monolingual task-specific fine-tuning enables strong data-to-text generation, including 20.65 BLEU on isiXhosa and results that can compete with encoder-decoder models much larger in size.
Multilingual task-specific fine-tuning improves closely related languages on topic classification, reaching 78.5% macro-F1 on isiXhosa news classification.
The authors find that while the model adapts well to supervised NLU/NLG, few-shot reasoning remains difficult at this scale, motivating the released baseline and guidance for low-resource adaptation.

Abstract

Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

Best Open Source LLM Observability Tools in 2026: Complete Guide

Dev.to

Arm breaks from its licensing-only model with first in-house chip built for AI data centers

THE DECODER

MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages

Key Points

Abstract

Related Articles

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Best Open Source LLM Observability Tools in 2026: Complete Guide

Arm breaks from its licensing-only model with first in-house chip built for AI data centers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer