Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

arXiv cs.AI / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that evaluating the disinformation risk of LLM-generated text requires measuring how human readers actually respond, rather than relying on LLM judges as a low-cost stand-in.
Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judge models, the authors audit judge-to-human alignment across overall scores, item-level ranking, and reliance on textual signals.
Results show persistent gaps: LLM judges score more harshly than humans, weakly recover human item-level rankings, and use different cues than human readers.
The judge models penalize emotional intensity more strongly and place more weight on logical rigor, indicating they are not merely mirroring human evaluation criteria.
Although the judges agree strongly with each other, they align poorly with human readers, suggesting that internal agreement among judges is not a reliable indicator of validity for proxying reader response.

Abstract

Large language models (LLMs) can generate persuasive narratives at scale, raising concerns about their potential use in disinformation campaigns. Assessing this risk ultimately requires understanding how readers receive such content. In practice, however, LLM judges are increasingly used as a low-cost substitute for direct human evaluation, even though whether they faithfully track reader responses remains unclear. We recast evaluation in this setting as a proxy-validity problem and audit LLM judges against human reader responses. Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judges, we examine judge--human alignment in terms of overall scoring, item-level ordering, and signal dependence. We find persistent judge--human gaps throughout. Relative to humans, judges are typically harsher, recover item-level human rankings only weakly, and rely on different textual signals, placing more weight on logical rigour while penalizing emotional intensity more strongly. At the same time, judges agree far more with one another than with human readers. These results suggest that LLM judges form a coherent evaluative group that is much more aligned internally than it is with human readers, indicating that internal agreement is not evidence of validity as a proxy for reader response.

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

Dev.to

How To Optimize Enterprise AI Energy Consumption

Dev.to

Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

Key Points

Abstract

Related Articles

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

Meta-Optimized Continual Adaptation for planetary geology survey missions for extreme data sparsity scenarios

How To Optimize Enterprise AI Energy Consumption

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer