M-DaQ: Retrieving Samples with Multilingual Diversity and Quality for Instruction Fine-Tuning Datasets

arXiv cs.CL / 5/1/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The paper introduces M-DaQ, a multilingual diversity-and-quality sampling framework aimed at building higher-quality instruction fine-tuning (IFT) datasets, which are currently scarce.
  • M-DaQ combines a fine-tuned quality scoring model with a maximal marginal relevance–inspired selection method to jointly optimize response quality and cross-lingual semantic diversity.
  • It also conducts the first systematic study of the Superficial Alignment Hypothesis in multilingual scenarios to understand alignment behavior across languages.
  • Experiments covering 18 languages show that models trained on M-DaQ-curated data achieve average win rates above 60% versus strong baselines on Alpaca-Eval and MT-Bench, with supporting human evaluation improvements in cultural relevance and instruction-following.
  • The authors release the code publicly to support reproducibility and enable follow-on research.

Abstract

Multilingual instruction fine-tuning (IFT) empowers large language models to generalize across diverse linguistic and cultural contexts; however, high-quality, systematically curated multilingual IFT datasets remain scarce. To address this gap, we propose M-DaQ (Multilingual Diversity and Quality), a diversity-aware sampling framework that jointly optimizes instruction-response quality and cross-lingual semantic diversity. M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data. Furthermore, we present the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages demonstrate that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench. Complementary human evaluations corroborate these gains, highlighting significant improvements in cultural relevance, contextual appropriateness, and instruction-following capability. The code are publicly released to facilitate reproducibility and future research.