Separate Before You Compress: The WWHO Tokenization Architecture

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that standard BPE tokenizers perform poorly on complex Abugida writing systems (e.g., Sinhala and Devanagari) by splitting multi-codepoint conjuncts into fragmented sub-character tokens that harm efficiency and increase inference cost.
It introduces WWHO, a three-layer tokenization architecture that separates script-specific linguistic structure from statistical compression, aiming to improve multilingual tokenization without breaking valid syllables.
The SGPE algorithm (Syllable-aware Grapheme Pair Encoding) is designed to be syllable-aware and provides a “Linguistic Zero-Breakage Guarantee,” ensuring no valid syllable is split across tokens.
Experiments on a cleaned 30M-sentence training set and a 1.5M-sentence test set show substantial token reductions versus common baselines, including up to 61.7% fewer tokens for Sinhala and overall context-window extensibility by as much as 4.38× for these languages.
The results frame tokenization as a key contributor to the so-called “Token Tax” affecting the Global South, suggesting practical gains in model cost and effective context length for these scripts.

Abstract

Current Large Language Models (LLMs) mostly use BPE (Byte Pair Encoding) based tokenizers, which are very effective for simple structured Latin scripts such as English. However, standard BPE tokenizers struggle to process complex Abugida scripts due to their structural complexity. The problem is that these tokenizers break complex conjuncts, which are multi-codepoint grapheme clusters, into meaningless sub-character units. This degrades the LLM's reasoning efficiency by forcing it to learn basic orthographic structures at inference time and raises inference costs, resulting in a significant "Token Tax" for the Global South. We propose a new three-layer architecture, the WWHO (Where-What-How Often), and an algorithm named SGPE (Syllable-aware Grapheme Pair Encoding) that separates the linguistic rules of the script from the statistical compression process while enabling seamless multilingual tokenization. Using Sinhala and Devanagari (Hindi/Sanskrit) as highly complex Abugida scripts, we trained WWHO on a cleaned 30-million-sentence dataset and evaluated on a 1,499,950-sentence test set. For Sinhala, SGPE achieves a Token to Word Ratio (TWR) of 1.274 with 4.83 characters per token, representing a 61.7 percent reduction in tokens compared to OpenAI's o200k base. For Hindi, it achieves a TWR of 1.181 (27.0 percent reduction vs o200k). On the mixed-script (Sinhala, Devanagari, and English) dataset, SGPE achieves an overall TWR of 1.240, representing token reductions of 36.7 percent, 39.6 percent, and 60.2 percent relative to o200k base, Llama 4 Scout, and DeepSeek V3, respectively. This effectively extends the usable context window by up to 4.38 times for these Abugida languages while ensuring a Linguistic Zero-Breakage Guarantee, which ensures that no valid syllable is ever split across multiple tokens.