Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how different transliteration methods (romanization, phonemic transcription, and substitution ciphers) and orthography affect multilingual NLP model performance, particularly for non-Latin scripts.
  • Controlled experiments on downstream tasks—named entity recognition (NER), part-of-speech tagging (POS), and natural language inference (NLI)—show that romanization significantly outperforms other input representations in 11 of 12 evaluation settings.
  • The authors analyze which linguistic factors matter most and find performance aligns with the hypothesis that romanization is generally the most effective transliteration approach.
  • A key driver of success is that romanization enables longer shared subword tokens with pretrained languages, improving how well multilingual models leverage existing pretraining.
  • Results suggest that transliteration design choices (not just model architecture) can substantially influence transfer and accuracy in multilingual NLP pipelines.

Abstract

Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on three downstream tasks -- named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 11 out of 12 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.