Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
arXiv cs.CL / 4/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard frequency-based subword tokenizers like BPE and WordPiece can poorly segment morphologically rich, agglutinative languages such as Turkish, obscuring morpheme boundaries.
- It proposes a linguistically informed hybrid Turkish tokenizer that combines dictionary-based morphological segmentation, phonological normalization for allomorph mapping, and a controlled subword fallback for out-of-vocabulary handling.
- The released vocabulary includes 22,231 root tokens mapped to 20,000 canonical root identifiers, 72 affix identifiers covering 177 allomorphic forms, 12,696 subword units, and an orthographic case token to preserve capitalization.
- On TR-MMLU, the tokenizer achieves 90.29% Turkish Token Percentage and 85.80% Pure Token Percentage, outperforming several general-purpose tokenizers using linguistic alignment metrics.
- Downstream evaluations with random-initialization controls show the tokenizer improves sentence embedding and linguistic benchmark performance, including strong results on Turkish STS, MTEB-TR, and TurBLiMP.
Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to
[P] Federated Adversarial Learning
Reddit r/MachineLearning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility
Towards Data Science