Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that common NLP annotation pipelines assume a single ground-truth label and often treat disagreement as noise, whereas perspectivist approaches view disagreement as potentially informative.
Using 6,323 graded health-literacy annotations from open-ended COVID-19 responses in Ecuador and Peru, annotators assigned proportional correctness scores against normative public-health guidelines to capture the full distribution of judgments.
Variance decomposition shows that question-level conceptual difficulty explains substantially more disagreement than annotator identity, suggesting disagreement is driven by the task rather than individual raters.
Agreement-stratified results indicate that effects such as country, education, and urban-rural differences can change magnitude and even reverse direction at different levels of inter-annotator agreement.
The authors conclude that graded interpretive tasks contain both epistemically stable and unstable components, and that strong perspectivist modeling is statistically necessary to avoid misleading conclusions from aggregated labels.

Abstract

Annotation pipelines in Natural Language Processing (NLP) commonly assume a single latent ground truth per instance and resolve disagreement through label aggregation. Perspectivist approaches challenge this view by treating disagreement as potentially informative rather than erroneous. We present a large-scale analysis of graded health-literacy annotations from 6,323 open-ended COVID-19 responses collected in Ecuador and Peru. Each response was independently labeled by multiple annotators using proportional correctness scores, reflecting the degree to which responses align with normative public-health guidelines, allowing us to analyze the full distribution of judgments rather than aggregated labels. Variance decomposition shows that question-level conceptual difficulty accounts for substantially more variance than annotator identity, indicating that disagreement is structured by the task itself rather than driven by individual raters. Agreement-stratified analyses further reveal that key social-scientific effects, including country, education, and urban-rural differences, vary in magnitude and in some cases reverse direction across levels of inter-annotator agreement. These findings suggest that graded health-literacy evaluation contains both epistemically stable and unstable components, and that aggregating across them can obscure important inferential differences. We therefore argue that strong perspectivist modeling is not only conceptually justified but statistically necessary for valid inference in graded interpretive tasks.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Dev.to

Why use an AI gateway at all?

Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

Dev.to

Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans

Why use an AI gateway at all?

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer