Tokens with Meaning: A Hybrid Tokenization Approach for Turkish

arXiv cs.CL / 4/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard frequency-based subword tokenizers like BPE and WordPiece can poorly segment morphologically rich, agglutinative languages such as Turkish, obscuring morpheme boundaries.
  • It proposes a linguistically informed hybrid Turkish tokenizer that combines dictionary-based morphological segmentation, phonological normalization for allomorph mapping, and a controlled subword fallback for out-of-vocabulary handling.
  • The released vocabulary includes 22,231 root tokens mapped to 20,000 canonical root identifiers, 72 affix identifiers covering 177 allomorphic forms, 12,696 subword units, and an orthographic case token to preserve capitalization.
  • On TR-MMLU, the tokenizer achieves 90.29% Turkish Token Percentage and 85.80% Pure Token Percentage, outperforming several general-purpose tokenizers using linguistic alignment metrics.
  • Downstream evaluations with random-initialization controls show the tokenizer improves sentence embedding and linguistic benchmark performance, including strong results on Turkish STS, MTEB-TR, and TurBLiMP.

Abstract

Tokenization shapes how language models perceive morphology and meaning in NLP, yet widely used frequency-driven subword tokenizers (e.g., Byte Pair Encoding and WordPiece) can fragment morphologically rich and agglutinative languages in ways that obscure morpheme boundaries. We introduce a linguistically informed hybrid tokenizer for Turkish that combines (i) dictionary-driven morphological segmentation (roots and affixes), (ii) phonological normalization that maps allomorphic variants to shared identifiers, and (iii) a controlled subword fallback for out-of-vocabulary coverage. Concretely, our released Turkish vocabulary contains 22,231 root tokens mapped to 20,000 canonical root identifiers (with leading spaces to mark word boundaries), 72 affix identifiers that cover 177 allomorphic surface forms, and 12,696 subword units; an orthographic case token preserves capitalization without inflating the vocabulary. We evaluate tokenization quality on the TR-MMLU dataset using two linguistic alignment metrics: Turkish Token Percentage (TR~\%), the proportion of produced tokens that correspond to Turkish lexical/morphemic units under our lexical resources, and Pure Token Percentage (Pure~\%), the proportion of tokens aligning with unambiguous root/affix boundaries. The proposed tokenizer reaches 90.29\% TR~\% and 85.80\% Pure~\% on TR-MMLU, substantially exceeding several general-purpose tokenizers. We further validate practical utility with downstream sentence embedding benchmarks under a strict \emph{random initialization} control to isolate tokenizer inductive bias. Across four matched models (TurkishTokenizer, CosmosGPT2, Mursit, and Tabi), TurkishTokenizer outperforms all baselines on the Turkish STS Benchmark and achieves the strongest overall average on MTEB-TR. It also yields the strongest average accuracy on the TurBLiMP under a centroid-based proxy.