Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that annotator disagreement on subjective content is structured and reflects demographic- and perspective-driven differences, not just random noise.
  • It finds that LLM-based approaches used as judges, including those with chain-of-thought prompting, struggle to recover the underlying structure of human disagreement.
  • The authors propose DiADEM, a neural architecture that learns demographic-axis importance (via a learned vector \(\boldsymbol{\alpha}\)) and models disagreement by combining annotator and item representations with interaction mechanisms and a disagreement-aware training loss.
  • Experiments on the DICES conversational-safety and VOICED political-offense benchmarks show DiADEM substantially outperforms prior LLM-as-a-judge and neural baselines, reaching strong disagreement tracking (e.g., \(r=0.75\) on DICES).
  • The learned importance weights indicate that race and age are consistently among the strongest demographic factors affecting disagreement across both datasets, underscoring the need to explicitly model who annotators are.

Abstract

When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators' social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns "how much each demographic axis matters" for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector \boldsymbol{\alpha}, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking (r{=}0.75 on DICES). The learned \boldsymbol{\alpha} weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.