Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

arXiv cs.CL / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether instruction-tuned LLMs can replace human labels in an active learning (AL) loop for anti-immigrant hostility detection, and whether AL is still needed when large portions of a corpus can be labeled cheaply by LLMs.
  • Using a new dataset of 277,902 German political TikTok comments with 25,974 LLM labels and 5,000 human annotations, the authors compare seven annotation strategies across four encoders.
  • A model trained on GPT-5.2 labels costing $43 achieves comparable macro-F1 to a model trained on human annotations costing $316, indicating strong cost–performance potential for LLM labeling.
  • The authors find that AL provides little advantage over random sampling in a pre-enriched label pool and that AL can yield lower F1 than full LLM annotation under similar budgets.
  • Despite similar aggregate F1 scores, the error profiles differ: LLM-trained models over-predict the positive class, with discrepancies concentrated in topically ambiguous cases, implying labeling strategy should consider acceptable error structure, not only macro-F1.

Abstract

Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels (\$43) achieves comparable F1-Macro to one trained on 3,800 human annotations (\$316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.