Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

arXiv cs.CL / 3/30/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper presents foundational NLP resources for historical Turkish, including HisTR (the first named entity recognition dataset) and OTA-BOUN (the first Universal Dependencies treebank for historical Turkish).
  • It introduces the Ottoman Text Corpus (OTC), a curated, clean corpus of transliterated historical Turkish spanning multiple historical periods to support broader research and evaluation.
  • Transformer-based models are trained and evaluated for key tasks—named entity recognition, dependency parsing, and part-of-speech tagging—showing strong results (90.29% F1 for NER, 73.79% LAS for parsing, and 94.98% F1 for POS tagging).
  • The study identifies remaining challenges such as domain adaptation needs and significant language variation across time periods, which may affect model portability.
  • All datasets and models are released via Hugging Face (hf.co/bucolin) to establish a benchmark for future advances in historical Turkish NLP.

Abstract

This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Furthermore, we introduce the Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results demonstrate prominent improvements in the computational analysis of historical Turkish, achieving strong performance on tasks that require understanding of historical linguistic structures -- specifically, 90.29% F1 in named entity recognition, 73.79% LAS for dependency parsing, and 94.98% F1 for part-of-speech tagging. They also highlight existing challenges, such as domain adaptation and language variations between time periods. All the resources and models presented are available at https://hf.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.