TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

arXiv cs.CL / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

TajikNLP is a newly announced open-source Python toolkit aimed at filling a major gap in publicly available NLP resources for Tajik written in Cyrillic.
It provides a comprehensive, modular end-to-end pipeline built around a unified Doc object, covering cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, POS tagging, stemming, lemmatization, and sentence splitting.
The toolkit introduces a unified morphology engine with controlled and deep analysis modes designed to better handle Tajik agglutinative nominal and verbal inflections.
TajikNLP also includes a lexicon-based sentiment analyzer and can load pre-trained Word2Vec/FastText embeddings from the Hugging Face Hub.
Reproducibility and future research are supported by openly published accompanying datasets (POS corpus, sentiment lexicon, toponym gazetteer, and personal names) and a large automated test suite (616 tests) achieving 93% code coverage.

Abstract

The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik's agglutinative nominal and verbal inflections. The release further incorporates a lexicon-based sentiment analyser and pre-trained Word2Vec/FastText embeddings loaded directly from the Hugging Face Hub. To ensure reproducibility and facilitate future research, four accompanying linguistic datasets -- a POS-tagged corpus (52.5k entries), a sentiment lexicon (3.5k entries), a toponym gazetteer (5.6k entries), and a personal names dataset (3.8k entries) -- have been openly published under permissive licenses. The library's reliability is validated by an extensive test suite of 616 automated tests achieving 93% source code coverage. TajikNLP thus establishes a foundational technological infrastructure for Tajik language processing, lowering the barrier to entry for both academic and industrial applications in low-resource Cyrillic-script environments.