Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode
arXiv cs.AI / 4/8/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper presents an independent stress-test evaluation of Anthropic’s Claude Code “auto mode,” which uses a two-stage transcript classifier to gate potentially dangerous tool calls.
- Using a new benchmark (AmPermBench) with deliberately ambiguous authorization scenarios, the study evaluates 253 state-changing actions at the individual-action level against oracle ground truth.
- The end-to-end false negative rate is found to be 81.0%—far higher than the 17% reported from production traffic—indicating the system behaves differently under underspecified “intent-clear but scope-unclear” workloads.
- A key driver of the high false negative rate is limited classifier coverage at “Tier 2” (in-project file edits), with 36.8% of state-changing actions falling outside the classifier’s scope; artifact cleanup via file edits is especially impacted (92.9% FNR).
- Even within the subset of actions the classifier evaluates (“Tier 3”), the false negative rate remains high (70.3%) and the false positive rate increases (31.9%), suggesting both missing coverage and stricter/gated decision behavior under the test design.
Related Articles
Efficient Inference with SGLang: Text and Image Generation
The Batch
Meta's latest model is as open as Zuckerberg's private school
The Register
I Have an AI Agent That Tests My Own Product Every 3 Hours
Dev.to
Why multi-agent AI security is broken (and the identity patterns that actually work)
Dev.to
BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.
Reddit r/artificial