Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that many LLM agent evaluations miss an ability to diagnose *why* a task is blocked before acting, focusing instead on fully specified execution outcomes or individual related behaviors in isolation.
  • It introduces the Support-State Triage Audit (SSTA-32), a matched-item counterfactual diagnostic framework that flips the same base request across four support states: Complete, Clarifiable, Support-Blocked, and Unsupported-Now.
  • Testing a frontier model with four prompting approaches shows that default “Direct” execution overcommits on non-complete tasks (41.7% overcommitment), while confidence-scalar mapping reduces overcommitment but performs poorly in distinguishing among deferral types.
  • In contrast, both “Action-Only” and a typed Preflight Support Check (PSC) reach 91.7% typed deferral accuracy by making the categorical decision ontology explicit in the prompt.
  • Ablation results indicate that removing the support-sufficiency dimension harms REQUEST SUPPORT accuracy, while removing the evidence-sufficiency dimension increases overcommitment on unsupported items, and the authors note the method’s single-context-window nature yields upper-bound capability estimates.

Abstract

Current agent evaluations largely reward execution on fully specified tasks, while recent work studies clarification [11, 22, 2], capability awareness [9, 1], abstention [8, 14], and search termination [20, 5] mostly in isolation. This leaves open whether agents can diagnose why a task is blocked before acting. We introduce the Support-State Triage Audit (SSTA-32), a matched-item diagnostic framework in which minimal counterfactual edits flip the same base request across four support states: Complete (ANSWER), Clarifiable (CLARIFY), Support-Blocked (REQUEST SUPPORT), and Unsupported-Now (ABSTAIN). We evaluate a frontier model under four prompting conditions - Direct, Action-Only, Confidence-Only, and a typed Preflight Support Check (PSC) - using Dual-Persona Auto-Auditing (DPAA) with deterministic heuristic scoring. Default execution overcommits heavily on non-complete tasks (41.7% overcommitment rate). Scalar confidence mapping avoids overcommitment but collapses the three-way deferral space (58.3% typed deferral accuracy). Conversely, both Action-Only and PSC achieve 91.7% typed deferral accuracy by surfacing the categorical ontology in the prompt. Targeted ablations confirm that removing the support-sufficiency dimension selectively degrades REQUEST SUPPORT accuracy, while removing the evidence-sufficiency dimension triggers systematic overcommitment on unsupported items. Because DPAA operates within a single context window, these results represent upper-bound capability estimates; nonetheless, the structural findings indicate that frontier models possess strong latent triage capabilities that require explicit categorical decision paths to activate safely.