How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

arXiv cs.CL / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper examines how the initial text tokenization step affects language models’ ability to encode phonological knowledge, despite tokenization ignoring word sounds.
Probing experiments show that subword tokenization systematically degrades both local phonological features (like rhyme) and global ones (like syllabification) in text-only LMs.
The authors introduce the syllabification-tokenization alignment distance (STAD) to quantify how misaligned token boundaries are with natural syllable boundaries, finding that higher STAD correlates with weaker phonological representations.
To mitigate these issues, they propose a lightweight IPA-based fine-tuning approach that improves performance on three phonology-related tasks while keeping impacts on math and general reasoning small (only 1.1% on GSM8K and 0.9% on MMLU).

Abstract

Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1\% and 0.9\% drops on GSM8K and MMLU, respectively.

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

Dev.to

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Dev.to

Salesforce Headless 360: Run Your CRM Without a Browser

Dev.to

RAG Systems in Production: Building Enterprise Knowledge Search

Dev.to

What Is the Difference Between Native and Cross-Platform App Development in 2026?

Dev.to

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

Key Points

Abstract

Related Articles

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Salesforce Headless 360: Run Your CRM Without a Browser

RAG Systems in Production: Building Enterprise Knowledge Search

What Is the Difference Between Native and Cross-Platform App Development in 2026?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer