HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

HeceTokenizerはトルコ語の決定的な6パターンの音韻構造を活用し、約8,000種類の閉集合・OOV-freeなシラブル（音節）単位のトークナイザを構築する手法として提案されています。
1.5MパラメータのBERT-tinyをトルコ語Wikipediaサブセットでスクラッチから事前学習（MLM）し、TQuADのRetrievalベンチマークで評価しています。
Recall@5は50.3%を達成し、200倍大きいモルフォロジー駆動のベースライン（46.92%）を上回っています。
さらに、細粒度のチャンクベースのリトリーバル戦略と組み合わせることで、音節の言語的規則性が検索タスクに有効で、かつリソース効率の良い帰納バイアスになり得ることを示唆しています。

Abstract

HeceTokenizer is a syllable-based tokenizer for Turkish that exploits the deterministic six-pattern phonological structure of the language to construct a closed, out-of-vocabulary (OOV)-free vocabulary of approximately 8,000 unique syllable types. A BERT-tiny encoder (1.5M parameters) is trained from scratch on a subset of Turkish Wikipedia using a masked language modeling objective and evaluated on the TQuAD retrieval benchmark using Recall@5. Combined with a fine-grained chunk-based retrieval strategy, HeceTokenizer achieves 50.3% Recall@5, surpassing the 46.92% reported by a morphology-driven baseline that uses a 200 times larger model. These results suggest that the phonological regularity of Turkish syllables provides a strong and resource-light inductive bias for retrieval tasks.

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval

Key Points

Abstract

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer