How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

arXiv cs.CL / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines how the initial text tokenization step affects language models’ ability to encode phonological knowledge, despite tokenization ignoring word sounds.
  • Probing experiments show that subword tokenization systematically degrades both local phonological features (like rhyme) and global ones (like syllabification) in text-only LMs.
  • The authors introduce the syllabification-tokenization alignment distance (STAD) to quantify how misaligned token boundaries are with natural syllable boundaries, finding that higher STAD correlates with weaker phonological representations.
  • To mitigate these issues, they propose a lightweight IPA-based fine-tuning approach that improves performance on three phonology-related tasks while keeping impacts on math and general reasoning small (only 1.1% on GSM8K and 0.9% on MMLU).

Abstract

Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1\% and 0.9\% drops on GSM8K and MMLU, respectively.