Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
arXiv cs.CL / 4/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the challenge that many languages lack enough native high-quality data to train robust quality classifiers for multilingual pretraining datasets.
- It proposes using quality markers in embedding space that may be cross-lingually consistent, enabling high-resource languages to help filter lower-resource ones.
- Experiments compare filtering strategies such as cross-lingual transfer, third quartile (Q3) sampling, and retention-rate tuning, evaluated on a 1B model trained with 103B tokens.
- Results show massive multilingual pooling often beats monolingual baselines in rank stability and overall accuracy, improving high-resource languages (e.g., French by +1.2% aggregate normalized accuracy) and matching or exceeding monolingual performance for low-resource languages.
- The authors also find that simply increasing multilingual scale is not sufficient for stability, and that for high-resource languages the decision boundary may require refinement via Q3 sampling or retention-rate tuning.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Why use an AI gateway at all?
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to