TurkicNLP: An NLP Toolkit for Turkic Languages

arXiv cs.CL / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • TurkicNLP is an open-source Python NLP toolkit aimed at unifying NLP pipelines for Turkic languages that currently lack consistent tooling and shared resources.
  • The library supports multiple writing systems—Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic—using automatic script detection and routing to the appropriate processing path.
  • It provides end-to-end NLP capabilities including tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, and cross-lingual sentence embeddings, plus machine translation via a single language-agnostic API.
  • TurkicNLP uses a modular multi-backend design that can transparently combine rule-based finite-state transducers with neural models for different tasks.
  • Outputs are formatted in the CoNLL-U standard to improve interoperability and make it easier to extend the toolkit; the code is published on GitHub.

Abstract

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .