Separate Before You Compress: The WWHO Tokenization Architecture
arXiv cs.CL / 3/27/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard BPE tokenizers perform poorly on complex Abugida writing systems (e.g., Sinhala and Devanagari) by splitting multi-codepoint conjuncts into fragmented sub-character tokens that harm efficiency and increase inference cost.
- It introduces WWHO, a three-layer tokenization architecture that separates script-specific linguistic structure from statistical compression, aiming to improve multilingual tokenization without breaking valid syllables.
- The SGPE algorithm (Syllable-aware Grapheme Pair Encoding) is designed to be syllable-aware and provides a “Linguistic Zero-Breakage Guarantee,” ensuring no valid syllable is split across tokens.
- Experiments on a cleaned 30M-sentence training set and a 1.5M-sentence test set show substantial token reductions versus common baselines, including up to 61.7% fewer tokens for Sinhala and overall context-window extensibility by as much as 4.38× for these languages.
- The results frame tokenization as a key contributor to the so-called “Token Tax” affecting the Global South, suggesting practical gains in model cost and effective context length for these scripts.
広告
Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.
Dev.to

The Redline Economy
Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks
Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists
Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure
Dev.to