Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

arXiv cs.CL / 3/30/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper presents foundational NLP resources for historical Turkish, including HisTR (the first named entity recognition dataset) and OTA-BOUN (the first Universal Dependencies treebank for historical Turkish).
It introduces the Ottoman Text Corpus (OTC), a curated, clean corpus of transliterated historical Turkish spanning multiple historical periods to support broader research and evaluation.
Transformer-based models are trained and evaluated for key tasks—named entity recognition, dependency parsing, and part-of-speech tagging—showing strong results (90.29% F1 for NER, 73.79% LAS for parsing, and 94.98% F1 for POS tagging).
The study identifies remaining challenges such as domain adaptation needs and significant language variation across time periods, which may affect model portability.
All datasets and models are released via Hugging Face (hf.co/bucolin) to establish a benchmark for future advances in historical Turkish NLP.

Abstract

This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Furthermore, we introduce the Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results demonstrate prominent improvements in the computational analysis of historical Turkish, achieving strong performance on tasks that require understanding of historical linguistic structures -- specifically, 90.29% F1 in named entity recognition, 73.79% LAS for dependency parsing, and 94.98% F1 for part-of-speech tagging. They also highlight existing challenges, such as domain adaptation and language variations between time periods. All the resources and models presented are available at https://hf.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer