意味を持つトークン：トルコ語のためのハイブリッド・トークン化アプローチ

arXiv cs.CL / 2026/4/1

📰 ニュースIdeas & Deep AnalysisModels & Research

共有:

要点

本論文は、BPEやWordPieceのような頻度ベースの標準的サブワード・トークナイザは、形態的に豊かな膠着語であるトルコ語に対して不適切に分割してしまい、形態素の境界を見えにくくすると主張する。
その代わりに、辞書ベースの形態素分割、異形態対応のための音韻的正規化、そして未知語への対応として制御されたサブワードのフォールバックを組み合わせた、言語学に基づくハイブリッドなトルコ語トークナイザを提案する。
公開される語彙は、20,000の正規化されたルート識別子に対応付けられた22,231のルートトークン、177の異形態をカバーする72の接辞識別子、12,696のサブワード単位、そして大文字小文字を保持するための正書法上のケース・トークンを含む。
TR-MMLUでは、このトークナイザはトルコ語トークン割合90.29%および純粋トークン割合85.80%を達成し、言語学的アラインメント指標に基づく複数の汎用トークナイザを上回る。
ランダム初期化による制御を伴う下流評価では、このトークナイザが文埋め込みおよび言語ベンチマークの性能を向上させ、トルコ語STS、MTEB-TR、TurBLiMPで特に強い結果を示す。

Abstract

Tokenization shapes how language models perceive morphology and meaning in NLP, yet widely used frequency-driven subword tokenizers (e.g., Byte Pair Encoding and WordPiece) can fragment morphologically rich and agglutinative languages in ways that obscure morpheme boundaries. We introduce a linguistically informed hybrid tokenizer for Turkish that combines (i) dictionary-driven morphological segmentation (roots and affixes), (ii) phonological normalization that maps allomorphic variants to shared identifiers, and (iii) a controlled subword fallback for out-of-vocabulary coverage. Concretely, our released Turkish vocabulary contains 22,231 root tokens mapped to 20,000 canonical root identifiers (with leading spaces to mark word boundaries), 72 affix identifiers that cover 177 allomorphic surface forms, and 12,696 subword units; an orthographic case token preserves capitalization without inflating the vocabulary. We evaluate tokenization quality on the TR-MMLU dataset using two linguistic alignment metrics: Turkish Token Percentage (TR~\%), the proportion of produced tokens that correspond to Turkish lexical/morphemic units under our lexical resources, and Pure Token Percentage (Pure~\%), the proportion of tokens aligning with unambiguous root/affix boundaries. The proposed tokenizer reaches 90.29\% TR~\% and 85.80\% Pure~\% on TR-MMLU, substantially exceeding several general-purpose tokenizers. We further validate practical utility with downstream sentence embedding benchmarks under a strict \emph{random initialization} control to isolate tokenizer inductive bias. Across four matched models (TurkishTokenizer, CosmosGPT2, Mursit, and Tabi), TurkishTokenizer outperforms all baselines on the Turkish STS Benchmark and achieves the strongest overall average on MTEB-TR. It also yields the strongest average accuracy on the TurBLiMP under a centroid-based proxy.