Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

arXiv cs.CL / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates a holistic event-annotation workflow that filters irrelevant documents, merges documents about the same event, and then performs event annotation.
  • It finds that LLM-based automated annotations outperform traditional TF-IDF-style methods and event set curation approaches, but they remain less reliable than expert human annotators.
  • The study shows that using LLMs as assistive tools for expert-driven event set curation can significantly reduce experts’ time and mental effort during variable annotation.
  • When LLMs are used to extract event variables to support expert annotators, agreement with the extracted variables is higher than when relying on fully automated LLM annotations.
  • Overall, the results suggest LLMs are best used as annotation assistants rather than independent coders for high-stakes, gold-standard event labeling.

Abstract

Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.