Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge that many languages lack enough native high-quality data to train robust quality classifiers for multilingual pretraining datasets.
  • It proposes using quality markers in embedding space that may be cross-lingually consistent, enabling high-resource languages to help filter lower-resource ones.
  • Experiments compare filtering strategies such as cross-lingual transfer, third quartile (Q3) sampling, and retention-rate tuning, evaluated on a 1B model trained with 103B tokens.
  • Results show massive multilingual pooling often beats monolingual baselines in rank stability and overall accuracy, improving high-resource languages (e.g., French by +1.2% aggregate normalized accuracy) and matching or exceeding monolingual performance for low-resource languages.
  • The authors also find that simply increasing multilingual scale is not sufficient for stability, and that for high-resource languages the decision boundary may require refinement via Q3 sampling or retention-rate tuning.

Abstract

As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.