Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that using the largest available language model as a “teacher” for multilingual synthetic SFT data is often ad hoc and can produce low-quality data that hurts smaller student models.
It introduces a multilingual evaluation approach (“Polyglot Score”) and reports experiments with 10 language models across 6 typologically diverse languages, generating 1.4M+ SFT examples and training 240 student models.
Gemma 3 27B and Aya Expanse 32B are found to be consistently effective multilingual teacher models across different student base model families.
The study finds teacher effectiveness is not well predicted by model scale alone; instead, intrinsic data qualities like prompt diversity, response length, and fluency explain most variance in data quality and correlate with student performance.
The authors provide practical recommendations for teacher-student model pairing and strategies like translating from existing prompts or responding to them to improve synthetic data for less-resourced languages.

Abstract

Synthesizing supervised finetuning (SFT) data from language models (LMs) to teach smaller models multilingual tasks has become increasingly common. However, teacher model selection is often ad hoc, typically defaulting to the largest available option, even though such models may have significant capability gaps in non-English languages. This practice can result in poor-quality synthetic data and suboptimal student downstream performance. In this work, we systematically characterize what makes an effective multilingual teacher. We measure intrinsic measures of data quality with extrinsic student model performance in a metric we call Polyglot Score; evaluating 10 LMs across 6 typologically diverse languages, generating over 1.4M SFT examples and training 240 student models. Among the models tested, Gemma 3 27B and Aya Expanse 32B emerge as consistently effective teachers across different student base model families. Further analyses reveal that model scale alone does not significantly predict teacher effectiveness; instead, data qualities such as prompt diversity, length, and response fluency capture over 93.3% of variance in intrinsic data quality and predict student performance. Finally, we provide practical recommendations, including matching the model families of teacher-student pairs and translating from or responding to existing prompts, which can yield improvements for less-resourced languages. We hope that our work advances data-centric research in multilingual synthetic data and LM development.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/14DailyView insight →

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation

Key Points

Abstract

💡 Insights using this article

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer