Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon

arXiv cs.CL / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether “non-conforming” basic vocabulary in six Sulawesi Austronesian languages reflects a pre-Austronesian substrate or independent innovation by testing this computationally.
  • It combines rule-based cognate subtraction with an XGBoost classifier over 26 phonological features, achieving AUC=0.763 for separating inherited from non-mainstream forms.
  • The machine-learning model finds a phonological fingerprint for non-mainstream candidates, including longer word forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes.
  • Cross-method agreement yields 266 high-confidence non-mainstream candidates (Cohen’s kappa=0.61), but clustering shows no coherent word families and no statistical support for a single shared substrate layer.
  • Applying the approach to 16 additional languages indicates geographic patterning, with higher predicted non-mainstream rates in Sulawesi (mean P_sub=0.606) than in Western Indonesian languages (0.393), supporting regional mixture rather than one substrate language.

Abstract

Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen's kappa=0.61) identifies 266 high-confidence non-mainstream candidates. However, clustering yields no coherent word families (silhouette=0.114; cross-linguistic cognate test p=0.569), providing no evidence for a single pre-Austronesian language layer. Application to 16 additional languages confirms geographic patterning: Sulawesi languages show higher predicted non-mainstream rates (mean P_sub=0.606) than Western Indonesian languages (0.393). This study demonstrates that phonological machine learning can complement traditional comparative methods in detecting non-mainstream lexical layers, while cautioning against interpreting phonological non-conformity as evidence for a shared substrate language.