A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

arXiv cs.CL / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The study provides a first comprehensive benchmark of machine transliteration models for the Tajik (Cyrillic) ↔ Persian (Arabic) language pair and evaluates multiple model families end-to-end.
  • A major contribution is the construction and validation of a large, parallel Tajik–Farsi corpus compiled from heterogeneous sources, starting from 328,253 sentence pairs and sampling a 40,000-pair subset.
  • Across six model classes (rule-based, LSTM+attention, character-level Transformer, G2P Transformer, multilingual pre-trained models, and byte-level ByT5), the byte-level ByT5 model delivers by far the best results (chrF++ 87.4 for Tajik→Farsi and 80.1 for reverse).
  • The G2P Transformer trained from scratch also performs strongly, beating mBART (72.3 vs. 62.2 chrF++), while multilingual models that rely on subword tokenization (mT5) fail badly (chrF++ < 18.5), suggesting strong sensitivity of transliteration quality to tokenization granularity.
  • The results indicate that accurate Tajik–Farsi transliteration is best achieved with byte/character-level architectures rather than traditional multilingual Seq2Seq approaches using subword tokenization.

Abstract

This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results demonstrate the overwhelming superiority of ByT5 (chrF++ 87.4 for Tajik to Farsi, 80.1 for reverse). The G2P Transformer significantly outperformed mBART (72.3 vs. 62.2 chrF++) despite limited data. Models using subword tokenization (mT5) failed completely (chrF++ less than 18.5). The findings demonstrate that for accurate transliteration of the Tajik-Farsi pair, architectures operating at the byte or character level are unequivocally more effective than traditional multilingual Seq2Seq models relying on subword tokenization.