TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)
arXiv cs.CL / 5/7/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- TajikNLP is a newly announced open-source Python toolkit aimed at filling a major gap in publicly available NLP resources for Tajik written in Cyrillic.
- It provides a comprehensive, modular end-to-end pipeline built around a unified Doc object, covering cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, POS tagging, stemming, lemmatization, and sentence splitting.
- The toolkit introduces a unified morphology engine with controlled and deep analysis modes designed to better handle Tajik agglutinative nominal and verbal inflections.
- TajikNLP also includes a lexicon-based sentiment analyzer and can load pre-trained Word2Vec/FastText embeddings from the Hugging Face Hub.
- Reproducibility and future research are supported by openly published accompanying datasets (POS corpus, sentiment lexicon, toponym gazetteer, and personal names) and a large automated test suite (616 tests) achieving 93% code coverage.




