Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication

arXiv cs.AI / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses “factual presumptuousness” in AI systems—confidently deciding when evidence is incomplete—which is especially harmful in legal settings like unemployment insurance adjudication.
  • Using a collaboration with Colorado’s Department of Labor and Employment, the researchers create a benchmark that varies systematically in information completeness to test how AI behaves under missing evidence.
  • Evaluations of four leading AI platforms show that standard RAG approaches drop to about 15% accuracy when information is insufficient, while more advanced prompting can both help and cause over-correction by deferring even when evidence is clear.
  • The authors propose SPEC (Structured Prompting for Evidence Checklists), a framework that forces explicit identification of missing information before any decision, achieving 89% overall accuracy and better deferral behavior when evidence is lacking.
  • The results suggest presumptuousness is a systematic failure mode in legal AI and can be mitigated to build systems that reliably support—rather than replace—human judgment pending sufficient evidence.

Abstract

A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking. This challenge is particularly acute in legal applications, where a core task for attorneys, judges, and administrators is to determine whether evidence is sufficient to reach a conclusion. We study this problem in the important setting of unemployment insurance adjudication, which has seen rapid integration of AI systems and where the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually. First, through a collaboration with the Colorado Department of Labor and Employment, we secure rare access to official training materials and guidance to design a novel benchmark that systematically varies in information completeness. Second, we evaluate four leading AI platforms and show that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient. Third, advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases. Fourth, we introduce a structured framework requiring explicit identification of missing information before any determination (SPEC, Structured Prompting for Evidence Checklists). SPEC achieves 89% overall accuracy, while appropriately deferring when evidence is insufficient -- demonstrating that presumptuousness in legal AI is systematic but addressable, and that doing so is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence.