Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon
arXiv cs.CL / 4/3/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates whether “non-conforming” basic vocabulary in six Sulawesi Austronesian languages reflects a pre-Austronesian substrate or independent innovation by testing this computationally.
- It combines rule-based cognate subtraction with an XGBoost classifier over 26 phonological features, achieving AUC=0.763 for separating inherited from non-mainstream forms.
- The machine-learning model finds a phonological fingerprint for non-mainstream candidates, including longer word forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes.
- Cross-method agreement yields 266 high-confidence non-mainstream candidates (Cohen’s kappa=0.61), but clustering shows no coherent word families and no statistical support for a single shared substrate layer.
- Applying the approach to 16 additional languages indicates geographic patterning, with higher predicted non-mainstream rates in Sulawesi (mean P_sub=0.606) than in Western Indonesian languages (0.393), supporting regional mixture rather than one substrate language.




