On the limited utility of parallel data for learning shared multilingual representations

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates whether parallel corpora (translated sentence pairs) meaningfully improve cross-lingual alignment when learning shared multilingual representations.
Experiments with varying proportions of parallel data find that its effect on alignment is minimal across multiple evaluation methods.
The benefits of parallel data appear limited to early pretraining, where it may slightly accelerate representation sharing before convergence.
The study also reports a model-level change where parallel data can reduce the amount of language-specific neurons, even though overall cross-lingual alignment levels are similar without parallel inputs.
Overall, the findings suggest that cross-lingual alignment can emerge at comparable levels without relying on explicit parallel-signal supervision.

Abstract

Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the effect is limited to potentially accelerating the representation sharing in the early phases of pretraining, and to decreasing the amount of language-specific neurons in the model. Cross-lingual alignment seems to emerge on similar levels even without the explicit signal from parallel data.