Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

arXiv cs.AI / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that evaluating the disinformation risk of LLM-generated text requires measuring how human readers actually respond, rather than relying on LLM judges as a low-cost stand-in.
  • Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judge models, the authors audit judge-to-human alignment across overall scores, item-level ranking, and reliance on textual signals.
  • Results show persistent gaps: LLM judges score more harshly than humans, weakly recover human item-level rankings, and use different cues than human readers.
  • The judge models penalize emotional intensity more strongly and place more weight on logical rigor, indicating they are not merely mirroring human evaluation criteria.
  • Although the judges agree strongly with each other, they align poorly with human readers, suggesting that internal agreement among judges is not a reliable indicator of validity for proxying reader response.

Abstract

Large language models (LLMs) can generate persuasive narratives at scale, raising concerns about their potential use in disinformation campaigns. Assessing this risk ultimately requires understanding how readers receive such content. In practice, however, LLM judges are increasingly used as a low-cost substitute for direct human evaluation, even though whether they faithfully track reader responses remains unclear. We recast evaluation in this setting as a proxy-validity problem and audit LLM judges against human reader responses. Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judges, we examine judge--human alignment in terms of overall scoring, item-level ordering, and signal dependence. We find persistent judge--human gaps throughout. Relative to humans, judges are typically harsher, recover item-level human rankings only weakly, and rely on different textual signals, placing more weight on logical rigour while penalizing emotional intensity more strongly. At the same time, judges agree far more with one another than with human readers. These results suggest that LLM judges form a coherent evaluative group that is much more aligned internally than it is with human readers, indicating that internal agreement is not evidence of validity as a proxy for reader response.

Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation | AI Navigate