Omnilingual MT: Machine Translation for 1,600 Languages

arXiv cs.CL / 3/18/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

Omnilingual Machine Translation (OMT) is reported as the first MT system to support more than 1,600 languages, marking a major expansion in multilingual coverage.
The scale is enabled by a data strategy that combines large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext.
The paper explores two LLM specialization approaches — as a decoder-only model (OMT-LLaMA) and as a module in an encoder-decoder architecture (OMT-NLLB) — with 1B–8B parameter models matching or exceeding a 70B LLM MT baseline.
English-to-1,600 translations show that while baselines can interpret undersupported languages, they often fail to generate them with fidelity, whereas OMT improves coherent generation and cross-lingual transfer.
The leaderboard and evaluation datasets (BOUQuET and Met-BOUQuET) are evolving toward Omnilinguality and will be freely available.

Abstract

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/18DailyView insight →

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Reddit r/MachineLearning

The Demethylation

Dev.to

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop

Reddit r/MachineLearning

Meet DuckLLM 1.0 My First Model!

Reddit r/LocalLLaMA

95% of UK students now use AI and their experiences couldn't be more divided

THE DECODER

Omnilingual MT: Machine Translation for 1,600 Languages

Key Points

Abstract

💡 Insights using this article

Related Articles

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

The Demethylation

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop

Meet DuckLLM 1.0 My First Model!

95% of UK students now use AI and their experiences couldn't be more divided

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer