A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures
arXiv cs.CL / 5/5/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The study provides a first comprehensive benchmark of machine transliteration models for the Tajik (Cyrillic) ↔ Persian (Arabic) language pair and evaluates multiple model families end-to-end.
- A major contribution is the construction and validation of a large, parallel Tajik–Farsi corpus compiled from heterogeneous sources, starting from 328,253 sentence pairs and sampling a 40,000-pair subset.
- Across six model classes (rule-based, LSTM+attention, character-level Transformer, G2P Transformer, multilingual pre-trained models, and byte-level ByT5), the byte-level ByT5 model delivers by far the best results (chrF++ 87.4 for Tajik→Farsi and 80.1 for reverse).
- The G2P Transformer trained from scratch also performs strongly, beating mBART (72.3 vs. 62.2 chrF++), while multilingual models that rely on subword tokenization (mT5) fail badly (chrF++ < 18.5), suggesting strong sensitivity of transliteration quality to tokenization granularity.
- The results indicate that accurate Tajik–Farsi transliteration is best achieved with byte/character-level architectures rather than traditional multilingual Seq2Seq approaches using subword tokenization.
Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.
Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage
TechCrunch

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)
Dev.to

Building an AI Image Generator SaaS in 2026: My Tech Stack and Lessons
Dev.to