Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

arXiv cs.LG / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests whether a transformer trained only on modern Bantu morphological data can recover cross-lingual lexical structure that matches established historical reconstructions.
  • Using BantuMorph v7 on 14 Eastern and Southern Bantu languages, the authors extract lemma embeddings and identify 728 noun and 1,525 verb cognate candidates shared across at least five languages.
  • When evaluated against historical resources (BLR3 Proto-Bantu reconstructions and ASJP), 10 of the top 11 noun candidates match reconstructed Proto-Bantu forms at high accuracy, and 12 verb cognates also align with known Proto-Bantu roots.
  • A cross-model check with NLLB-600M supports that both models recover cognate clusters and phylogenetic groupings consistent with Guthrie-zone classifications, with statistical significance reported.
  • Cross-lingual noun class analysis shows strong within-class embedding similarity across languages for all productive classes, suggesting the model captures stable lexical/morphological structure shared across the Bantu languages studied.

Abstract

We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.