Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that annotator disagreement on subjective content is structured and reflects demographic- and perspective-driven differences, not just random noise.
It finds that LLM-based approaches used as judges, including those with chain-of-thought prompting, struggle to recover the underlying structure of human disagreement.
The authors propose DiADEM, a neural architecture that learns demographic-axis importance (via a learned vector \(\boldsymbol{\alpha}\)) and models disagreement by combining annotator and item representations with interaction mechanisms and a disagreement-aware training loss.
Experiments on the DICES conversational-safety and VOICED political-offense benchmarks show DiADEM substantially outperforms prior LLM-as-a-judge and neural baselines, reaching strong disagreement tracking (e.g., \(r=0.75\) on DICES).
The learned importance weights indicate that race and age are consistently among the strongest demographic factors affecting disagreement across both datasets, underscoring the need to explicitly model who annotators are.

Abstract

When humans label subjective content, they disagree, and that disagreement is not noise. It reflects genuine differences in perspective shaped by annotators' social identities and lived experiences. Yet standard practice still flattens these judgments into a single majority label, and recent LLM-based approaches fare no better: we show that prompted large language models, even with chain-of-thought reasoning, fail to recover the structure of human disagreement. We introduce DiADEM, a neural architecture that learns "how much each demographic axis matters" for predicting who will disagree and on what. DiADEM encodes annotators through per-demographic projections governed by a learned importance vector

\boldsymbol{\alpha}

, fuses annotator and item representations via complementary concatenation and Hadamard interactions, and is trained with a novel item-level disagreement loss that directly penalizes mispredicted annotation variance. On the DICES conversational-safety and VOICED political-offense benchmarks, DiADEM substantially outperforms both the LLM-as-a-judge and neural model baselines across standard and perspectivist metrics, achieving strong disagreement tracking (

r{=}0.75

on DICES). The learned

\boldsymbol{\alpha}

weights reveal that race and age consistently emerge as the most influential demographic factors driving annotator disagreement across both datasets. Our results demonstrate that explicitly modeling who annotators are not just what they label is essential for NLP systems that aim to faithfully represent human interpretive diversity.

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

Dev.to

How To Optimize Enterprise AI Energy Consumption

Dev.to

Learning Who Disagrees: Demographic Importance Weighting for Modeling Annotator Distributions with DiADEM

Key Points

Abstract

Related Articles

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

How To Optimize Enterprise AI Energy Consumption

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer