LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard LLM emotion-annotation evaluations over-simplify human disagreement by reducing it to a single “gold” label, ignoring the distributional structure of annotator uncertainty.
  • It compares human emotion-judgment distributions with those produced by four zero-shot LLMs and a fine-tuned RoBERTa model on GoEmotions and EmoBank (640,000 LLM responses total), finding that zero-shot outputs diverge substantially from human distributions.
  • In-domain fine-tuning (rather than merely increasing model scale) is shown to be necessary to close the human–LLM distributional gap for emotion labeling.
  • The authors introduce a transparency score based on a lexical-grounding gradient, concluding that LLMs work best when emotions are signaled by explicit lexical markers and struggle with pragmatically complex emotions that require contextual inference.
  • Three lightweight post-hoc calibration methods can reduce the distributional gap by up to 14%, and the paper provides guidance on when LLM emotion annotations can replace human labels and when they cannot.

Abstract

Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14\%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.