Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

arXiv cs.LG / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a zero-shot approach to discover morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering.
  • Using Giriama (nyf) as a test case with only 91 labeled paradigms, the method assigns noun classes for 2,455 words and uncovers two previously undocumented morphological patterns with high consistency.
  • External validation on 444 known Giriama verb paradigms yields 78.2% lemmatization accuracy, and a larger v3 corpus expansion (19,624 words) improves performance to 97.3% segmentation and 86.7% lemmatization across major word classes.
  • The authors argue that a weighted-voting ensemble works best because transfer learning captures cognates via substantial vocabulary overlap (~60%), while clustering identifies language-specific innovations that transfer may miss.
  • All code and the discovered lexicons are released to support morphological documentation efforts for other low-resource Bantu languages.

Abstract

We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.