Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

arXiv cs.CL / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates whether web-scale unlabeled multilingual text plus LLM-generated synthetic labels can improve hateful language detection across four languages (English, German, Spanish, Vietnamese).
  • Continued pre-training of BERT on unlabeled web data followed by supervised fine-tuning improves macro-F1 by about 3% on sixteen benchmarks, with larger gains in low-resource settings.
  • It compares three LLM ensemble annotation methods (mean averaging, majority voting, and a LightGBM meta-learner), finding that the LightGBM ensemble is consistently best.
  • Training smaller models on the synthetic labels yields large improvements (e.g., Llama3.2-1B gains about +11% pooled F1), while larger models see only modest benefit (e.g., Qwen2.5-14B about +0.6%).
  • Overall, the authors conclude that combining web-scale unlabeled data with LLM-ensemble annotations is especially valuable for small models and low-resource languages.

Abstract

We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.