LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps
arXiv cs.CL / 5/1/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard LLM emotion-annotation evaluations over-simplify human disagreement by reducing it to a single “gold” label, ignoring the distributional structure of annotator uncertainty.
- It compares human emotion-judgment distributions with those produced by four zero-shot LLMs and a fine-tuned RoBERTa model on GoEmotions and EmoBank (640,000 LLM responses total), finding that zero-shot outputs diverge substantially from human distributions.
- In-domain fine-tuning (rather than merely increasing model scale) is shown to be necessary to close the human–LLM distributional gap for emotion labeling.
- The authors introduce a transparency score based on a lexical-grounding gradient, concluding that LLMs work best when emotions are signaled by explicit lexical markers and struggle with pragmatically complex emotions that require contextual inference.
- Three lightweight post-hoc calibration methods can reduce the distributional gap by up to 14%, and the paper provides guidance on when LLM emotion annotations can replace human labels and when they cannot.
Related Articles
Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Why Enterprise AI Pilots Fail
Dev.to
Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to
75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to