Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

arXiv cs.CL / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study examines low-resource machine translation that uses LLMs to generate synthetic training data from higher-resource languages, focusing on Romansh as a test case.
It finds that naive translation-based data augmentation can fail for Romansh because LLMs conflate its six distinct language varieties.
The authors propose aligning augmentation direction with the resource gradient between source and target languages rather than using a fixed source→target direction.
Experiments report that this resource-gradient-aligned approach improves performance, surpassing Gemini 3 Pro by 23 BLEU on the lowest-resource Romansh variety.
Human evaluation indicates the method produces fluent translations for individual Romansh varieties, claiming the first such model achievement for those varieties.

Abstract

Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

The Redline Economy

Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks

Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

Dev.to

Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Key Points

Abstract

Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

The Redline Economy

$500 GPU outperforms Claude Sonnet on coding benchmarks

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer