Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multilingual LLMs underperform on cross-lingual tasks due to high/low-resource data imbalances and monolingual bias during pre-training.
  • It proposes a Cross-Lingual Mapping Task added to pre-training that performs bi-directional mapping of languages in the model’s embedding space to improve alignment without hurting monolingual fluency.
  • To measure cross-lingual consistency reliably, it introduces a Language Alignment Coefficient that works even when labeled or parallel data is limited.
  • Experiments across machine translation, cross-lingual NLU, and cross-lingual question answering report substantial improvements versus strong multilingual baselines, including up to +11.9 BLEU for MT and +6.72 BERTScore-Precision for CLQA.
  • Overall, the work suggests that adding cross-lingual objectives directly into pre-training is an effective path to boosting multilingual LLM performance across multiple cross-lingual benchmarks.

Abstract

Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.