Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
arXiv cs.CL / 3/30/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper presents foundational NLP resources for historical Turkish, including HisTR (the first named entity recognition dataset) and OTA-BOUN (the first Universal Dependencies treebank for historical Turkish).
- It introduces the Ottoman Text Corpus (OTC), a curated, clean corpus of transliterated historical Turkish spanning multiple historical periods to support broader research and evaluation.
- Transformer-based models are trained and evaluated for key tasks—named entity recognition, dependency parsing, and part-of-speech tagging—showing strong results (90.29% F1 for NER, 73.79% LAS for parsing, and 94.98% F1 for POS tagging).
- The study identifies remaining challenges such as domain adaptation needs and significant language variation across time periods, which may affect model portability.
- All datasets and models are released via Hugging Face (hf.co/bucolin) to establish a benchmark for future advances in historical Turkish NLP.
Related Articles

Black Hat Asia
AI Business
Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to
I missed the "fun" part in software development
Dev.to
The Billion Dollar Tax on AI Agents
Dev.to