Vocabulary shapes cross-lingual variation of word-order learnability in language models

arXiv cs.AI / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study investigates cross-lingual learnability of word order by pretraining transformer language models on synthetic word-order variants of natural languages.
It finds that greater word-order irregularity raises model surprisal, indicating reduced learnability.
Sentence reversal affects learnability only weakly, suggesting limited sensitivity to certain word-order perturbations.
The authors show that vocabulary structure (word and subword inventory) predicts surprisal better than coarse free- vs fixed-order language classifications, making vocabulary the key driver of cross-lingual learnability.

Abstract

Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

Interactive Web Visualization of GPT-2

Reddit r/artificial

Stop Treating AI Interview Fraud Like a Proctoring Problem

Dev.to

[R] Causal self-attention as a probabilistic model over embeddings

Reddit r/MachineLearning

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Dev.to

InVideo AI Review: Fast Finished

Dev.to

Vocabulary shapes cross-lingual variation of word-order learnability in language models

Key Points

Abstract

Related Articles

Interactive Web Visualization of GPT-2

Stop Treating AI Interview Fraud Like a Proctoring Problem

[R] Causal self-attention as a probabilistic model over embeddings

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

InVideo AI Review: Fast Finished

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer