CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a key limitation of existing multilingual embedding models: they often fail to properly learn cross-lingual alignment, especially when linguistic resources are imbalanced and training doesn’t explicitly enforce alignment.
  • It proposes CLEAR (Cross-Lingual Enhancement in Retrieval via Reverse-training), a new loss function that uses a reverse-training scheme with English passages as a “bridge” to strengthen alignment between a target language and English.
  • Experiments show CLEAR improves cross-lingual retrieval performance by up to 15%, with the largest gains in low-resource languages while largely avoiding degradation in English.
  • The authors report CLEAR remains effective even in multilingual training settings, indicating potential scalability and broader applicability beyond single adaptation setups.
  • The accompanying code is released on GitHub, enabling researchers and engineers to reproduce and build on the method.

Abstract

Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at https://github.com/dltmddbs100/CLEAR.