Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
arXiv cs.CL / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multilingual LLMs underperform on cross-lingual tasks due to high/low-resource data imbalances and monolingual bias during pre-training.
- It proposes a Cross-Lingual Mapping Task added to pre-training that performs bi-directional mapping of languages in the model’s embedding space to improve alignment without hurting monolingual fluency.
- To measure cross-lingual consistency reliably, it introduces a Language Alignment Coefficient that works even when labeled or parallel data is limited.
- Experiments across machine translation, cross-lingual NLU, and cross-lingual question answering report substantial improvements versus strong multilingual baselines, including up to +11.9 BLEU for MT and +6.72 BERTScore-Precision for CLQA.
- Overall, the work suggests that adding cross-lingual objectives directly into pre-training is an effective path to boosting multilingual LLM performance across multiple cross-lingual benchmarks.


