AI Navigate

The Autonomy Tax: Defense Training Breaks LLM Agents

arXiv cs.AI / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study demonstrates a capability-alignment paradox where defense training to improve safety reduces agent competence and fails to prevent sophisticated prompt-based attacks, based on experiments across 97 tasks and 1,000 adversarial prompts.
  • It identifies three biases unique to multi-step agents: Agent incompetence bias (immediate tool-action failures on benign tasks), Cascade amplification bias (early failures propagating through retries causing near-universal timeouts), and Trigger bias (defended models performing worse than undefended ones while simple attacks bypass defenses).
  • The root cause is shortcut learning, with models overfitting to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories.
  • The findings suggest current defense paradigms optimize single-turn refusal benchmarks but render multi-step tool use unreliable under adversarial conditions, calling for new approaches that preserve tool execution competence while maintaining safety.
  • The work underscores the need for rethinking evaluation and defense strategies for LLM agents to balance safety and functionality in adversarial environments.

Abstract

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.