Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

arXiv cs.LG / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a zero-shot approach to discover morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering.
Using Giriama (nyf) as a test case with only 91 labeled paradigms, the method assigns noun classes for 2,455 words and uncovers two previously undocumented morphological patterns with high consistency.
External validation on 444 known Giriama verb paradigms yields 78.2% lemmatization accuracy, and a larger v3 corpus expansion (19,624 words) improves performance to 97.3% segmentation and 86.7% lemmatization across major word classes.
The authors argue that a weighted-voting ensemble works best because transfer learning captures cognates via substantial vocabulary overlap (~60%), while clustering identifies language-specific innovations that transfer may miss.
All code and the discovered lexicons are released to support morphological documentation efforts for other low-resource Bantu languages.

Abstract

We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.

Subagents: The Building Block of Agentic AI

Dev.to

DeepSeek-V4 Models Could Change Global AI Race

AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch

Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

Dev.to

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

Key Points

Abstract

Related Articles

Subagents: The Building Block of Agentic AI

DeepSeek-V4 Models Could Change Global AI Race

Got OpenAI's privacy filter model running on-device via ExecuTorch

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer