Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates how different transliteration methods (romanization, phonemic transcription, and substitution ciphers) and orthography affect multilingual NLP model performance, particularly for non-Latin scripts.
Controlled experiments on downstream tasks—named entity recognition (NER), part-of-speech tagging (POS), and natural language inference (NLI)—show that romanization significantly outperforms other input representations in 11 of 12 evaluation settings.
The authors analyze which linguistic factors matter most and find performance aligns with the hypothesis that romanization is generally the most effective transliteration approach.
A key driver of success is that romanization enables longer shared subword tokens with pretrained languages, improving how well multilingual models leverage existing pretraining.
Results suggest that transliteration design choices (not just model architecture) can substantially influence transfer and accuracy in multilingual NLP pipelines.

Abstract

Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on three downstream tasks -- named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI) -- and find that romanization significantly outperforms other input types in 11 out of 12 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

They Did Not Accidentally Make Work the Answer to Who You Are

Dev.to

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

They Did Not Accidentally Make Work the Answer to Who You Are

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer