Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the challenge that many languages lack enough native high-quality data to train robust quality classifiers for multilingual pretraining datasets.
It proposes using quality markers in embedding space that may be cross-lingually consistent, enabling high-resource languages to help filter lower-resource ones.
Experiments compare filtering strategies such as cross-lingual transfer, third quartile (Q3) sampling, and retention-rate tuning, evaluated on a 1B model trained with 103B tokens.
Results show massive multilingual pooling often beats monolingual baselines in rank stability and overall accuracy, improving high-resource languages (e.g., French by +1.2% aggregate normalized accuracy) and matching or exceeding monolingual performance for low-resource languages.
The authors also find that simply increasing multilingual scale is not sufficient for stability, and that for high-resource languages the decision boundary may require refinement via Q3 sampling or retention-rate tuning.

Abstract

As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Dev.to

Why use an AI gateway at all?

Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

Dev.to

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Why use an AI gateway at all?

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer