Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
arXiv cs.CL / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies whether instruction-tuned LLMs can replace human labels in an active learning (AL) loop for anti-immigrant hostility detection, and whether AL is still needed when large portions of a corpus can be labeled cheaply by LLMs.
- Using a new dataset of 277,902 German political TikTok comments with 25,974 LLM labels and 5,000 human annotations, the authors compare seven annotation strategies across four encoders.
- A model trained on GPT-5.2 labels costing $43 achieves comparable macro-F1 to a model trained on human annotations costing $316, indicating strong cost–performance potential for LLM labeling.
- The authors find that AL provides little advantage over random sampling in a pre-enriched label pool and that AL can yield lower F1 than full LLM annotation under similar budgets.
- Despite similar aggregate F1 scores, the error profiles differ: LLM-trained models over-predict the positive class, with discrepancies concentrated in topically ambiguous cases, implying labeling strategy should consider acceptable error structure, not only macro-F1.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to