Multilingual Language Models Encode Script Over Linguistic Structure
arXiv cs.LG / 4/8/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes how multilingual language models form internal representations, testing whether they are organized more by abstract language identity/typology or by surface-form cues such as orthography.
- Using the Language Activation Probability Entropy (LAPE) metric and Sparse Autoencoders on compact distilled versions of Llama-3.2-1B and Gemma-2-2B, the authors find that orthography dominates representation structure.
- Romanization leads to near-disjoint internal representations that do not align well with either native-script inputs or with English, indicating strong sensitivity to surface-form changes.
- Word-order shuffling has limited impact on which internal “language-associated units” are activated, suggesting typological order is not the primary driver of unit identity.
- The study finds that typological information becomes more accessible in deeper layers, and causal interventions show generation depends more on units invariant to surface-form perturbations than on units selected purely by typological alignment.
Related Articles
[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project
Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents
Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing
Dev.to
Google isn’t an AI-first company despite Gemini being great
Reddit r/artificial

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free
Dev.to