Vocabulary shapes cross-lingual variation of word-order learnability in language models

arXiv cs.AI / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study investigates cross-lingual learnability of word order by pretraining transformer language models on synthetic word-order variants of natural languages.
  • It finds that greater word-order irregularity raises model surprisal, indicating reduced learnability.
  • Sentence reversal affects learnability only weakly, suggesting limited sensitivity to certain word-order perturbations.
  • The authors show that vocabulary structure (word and subword inventory) predicts surprisal better than coarse free- vs fixed-order language classifications, making vocabulary the key driver of cross-lingual learnability.

Abstract

Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.