LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that standard LLM emotion-annotation evaluations over-simplify human disagreement by reducing it to a single “gold” label, ignoring the distributional structure of annotator uncertainty.
It compares human emotion-judgment distributions with those produced by four zero-shot LLMs and a fine-tuned RoBERTa model on GoEmotions and EmoBank (640,000 LLM responses total), finding that zero-shot outputs diverge substantially from human distributions.
In-domain fine-tuning (rather than merely increasing model scale) is shown to be necessary to close the human–LLM distributional gap for emotion labeling.
The authors introduce a transparency score based on a lexical-grounding gradient, concluding that LLMs work best when emotions are signaled by explicit lexical markers and struggle with pragmatically complex emotions that require contextual inference.
Three lightweight post-hoc calibration methods can reduce the distributional gap by up to 14%, and the paper provides guidance on when LLM emotion annotations can replace human labels and when they cannot.

Abstract

Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14\%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Why Enterprise AI Pilots Fail

Dev.to

Announcing the NVIDIA Nemotron 3 Super Build Contest

Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Dev.to

LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Why Enterprise AI Pilots Fail

Announcing the NVIDIA Nemotron 3 Super Build Contest

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer