Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference

arXiv cs.CL / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common NLP annotation pipelines assume a single ground-truth label and often treat disagreement as noise, whereas perspectivist approaches view disagreement as potentially informative.
  • Using 6,323 graded health-literacy annotations from open-ended COVID-19 responses in Ecuador and Peru, annotators assigned proportional correctness scores against normative public-health guidelines to capture the full distribution of judgments.
  • Variance decomposition shows that question-level conceptual difficulty explains substantially more disagreement than annotator identity, suggesting disagreement is driven by the task rather than individual raters.
  • Agreement-stratified results indicate that effects such as country, education, and urban-rural differences can change magnitude and even reverse direction at different levels of inter-annotator agreement.
  • The authors conclude that graded interpretive tasks contain both epistemically stable and unstable components, and that strong perspectivist modeling is statistically necessary to avoid misleading conclusions from aggregated labels.

Abstract

Annotation pipelines in Natural Language Processing (NLP) commonly assume a single latent ground truth per instance and resolve disagreement through label aggregation. Perspectivist approaches challenge this view by treating disagreement as potentially informative rather than erroneous. We present a large-scale analysis of graded health-literacy annotations from 6,323 open-ended COVID-19 responses collected in Ecuador and Peru. Each response was independently labeled by multiple annotators using proportional correctness scores, reflecting the degree to which responses align with normative public-health guidelines, allowing us to analyze the full distribution of judgments rather than aggregated labels. Variance decomposition shows that question-level conceptual difficulty accounts for substantially more variance than annotator identity, indicating that disagreement is structured by the task itself rather than driven by individual raters. Agreement-stratified analyses further reveal that key social-scientific effects, including country, education, and urban-rural differences, vary in magnitude and in some cases reverse direction across levels of inter-annotator agreement. These findings suggest that graded health-literacy evaluation contains both epistemically stable and unstable components, and that aggregating across them can obscure important inferential differences. We therefore argue that strong perspectivist modeling is not only conceptually justified but statistically necessary for valid inference in graded interpretive tasks.