Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

arXiv cs.CL / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a large-scale synthetic dataset for multilingual, multi-label emotion classification, addressing the lack of non-English and multi-label annotated data in existing corpora.
  • It builds 1M+ training samples across 23 languages (50k per language) using culturally adapted generation plus programmatic quality filtering, labeled with 11 emotion categories.
  • Six multilingual transformer encoders are trained under identical settings, and the best-performing model is XLM-R-Large (560M), achieving 0.868 F1-micro and 0.987 AUC-micro on the in-domain test set.
  • Zero-shot evaluations on human-annotated benchmarks (GoEmotions and SemEval-2018 Task 1 E-c) show the top model matching or outperforming English-specialist baselines on ranking metrics while covering all 23 languages.
  • A best base-sized model is released publicly on Hugging Face, enabling others to reuse and benchmark multilingual emotion classifiers.

Abstract

Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at https://huggingface.co/tabularisai/multilingual-emotion-classification