Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
arXiv cs.LG / 4/27/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study tests whether a transformer trained only on modern Bantu morphological data can recover cross-lingual lexical structure that matches established historical reconstructions.
- Using BantuMorph v7 on 14 Eastern and Southern Bantu languages, the authors extract lemma embeddings and identify 728 noun and 1,525 verb cognate candidates shared across at least five languages.
- When evaluated against historical resources (BLR3 Proto-Bantu reconstructions and ASJP), 10 of the top 11 noun candidates match reconstructed Proto-Bantu forms at high accuracy, and 12 verb cognates also align with known Proto-Bantu roots.
- A cross-model check with NLLB-600M supports that both models recover cognate clusters and phylogenetic groupings consistent with Guthrie-zone classifications, with statistical significance reported.
- Cross-lingual noun class analysis shows strong within-class embedding similarity across languages for all productive classes, suggesting the model captures stable lexical/morphological structure shared across the Bantu languages studied.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to
We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to