Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode

arXiv cs.AI / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents an independent stress-test evaluation of Anthropic’s Claude Code “auto mode,” which uses a two-stage transcript classifier to gate potentially dangerous tool calls.
Using a new benchmark (AmPermBench) with deliberately ambiguous authorization scenarios, the study evaluates 253 state-changing actions at the individual-action level against oracle ground truth.
The end-to-end false negative rate is found to be 81.0%—far higher than the 17% reported from production traffic—indicating the system behaves differently under underspecified “intent-clear but scope-unclear” workloads.
A key driver of the high false negative rate is limited classifier coverage at “Tier 2” (in-project file edits), with 36.8% of state-changing actions falling outside the classifier’s scope; artifact cleanup via file edits is especially impacted (92.9% FNR).
Even within the subset of actions the classifier evaluates (“Tier 3”), the false negative rate remains high (70.3%) and the false positive rate increases (31.9%), suggesting both missing coverage and stricter/gated decision behavior under the test design.

Abstract

Claude Code's auto mode is the first deployed permission system for AI coding agents, using a two-stage transcript classifier to gate dangerous tool calls. Anthropic reports a 0.4% false positive rate and 17% false negative rate on production traffic. We present the first independent evaluation of this system on deliberately ambiguous authorization scenarios, i.e., tasks where the user's intent is clear but the target scope, blast radius, or risk level is underspecified. Using AmPermBench, a 128-prompt benchmark spanning four DevOps task families and three controlled ambiguity axes, we evaluate 253 state-changing actions at the individual action level against oracle ground truth. Our findings characterize auto mode's scope-escalation coverage under this stress-test workload. The end-to-end false negative rate is 81.0% (95% CI: 73.8%-87.4%), substantially higher than the 17% reported on production traffic, reflecting a fundamentally different workload rather than a contradiction. Notably, 36.8% of all state-changing actions fall outside the classifier's scope via Tier 2 (in-project file edits), contributing to the elevated end-to-end FNR. Even restricting to the 160 actions the classifier actually evaluates (Tier 3), the FNR remains 70.3%, while the FPR rises to 31.9%. The Tier 2 coverage gap is most pronounced on artifact cleanup (92.9% FNR), where agents naturally fall back to editing state files when the expected CLI is unavailable. These results highlight a coverage boundary worth examining: auto mode assumes dangerous actions transit the shell, but agents routinely achieve equivalent effects through file edits that the classifier does not evaluate.

Efficient Inference with SGLang: Text and Image Generation

The Batch

Meta's latest model is as open as Zuckerberg's private school

The Register

I Have an AI Agent That Tests My Own Product Every 3 Hours

Dev.to

Why multi-agent AI security is broken (and the identity patterns that actually work)

Dev.to

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

Reddit r/artificial

Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode

Key Points

Abstract

Related Articles

Efficient Inference with SGLang: Text and Image Generation

Meta's latest model is as open as Zuckerberg's private school

I Have an AI Agent That Tests My Own Product Every 3 Hours

Why multi-agent AI security is broken (and the identity patterns that actually work)

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer