Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

arXiv cs.AI / 4/25/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that rule-governed AI evaluation using agreement with human labels can be misleading because multiple outputs may be logically valid under the same policy, leading to what it calls the “Agreement Trap.”
It proposes policy-grounded correctness with new metrics—the Defensibility Index (DI) and Ambiguity Index (AI)—to measure whether a decision is logically derivable from the governing rule hierarchy.
To estimate reasoning stability without extra audit runs, the authors introduce the Probabilistic Defensibility Signal (PDS), computed from audit-model token logprobs, and they use LLM reasoning traces as governance signals rather than final classification outputs.
Experiments on 193,000+ Reddit moderation decisions show large differences between agreement-based and policy-grounded metrics (a 33–46.6 percentage-point gap) and that many false negatives align with policy-grounded rather than true errors.
A “Governance Gate” using these signals reportedly reaches 78.6% automation coverage while reducing risk by 64.9%, and ambiguity is shown to depend mainly on rule specificity rather than decoding noise.

Abstract

Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error -- a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model's false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.

Navigating WooCommerce AI Integrations: Lessons for Agencies & Developers from a Bluehost Conflict

Dev.to

One Day in Shenzhen, Seen Through an AI's Eyes

Dev.to

Underwhelming or underrated? DeepSeek V4 shows “impressive” gains

SCMP Tech

Claude Code: Hooks, Subagents, and Skills — Complete Guide

Dev.to

Finding the Gold: An AI Framework for Highlight Detection

Dev.to

Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

Key Points

Abstract

Related Articles

Navigating WooCommerce AI Integrations: Lessons for Agencies & Developers from a Bluehost Conflict

One Day in Shenzhen, Seen Through an AI's Eyes

Underwhelming or underrated? DeepSeek V4 shows “impressive” gains

Claude Code: Hooks, Subagents, and Skills — Complete Guide

Finding the Gold: An AI Framework for Highlight Detection

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer