Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

arXiv cs.AI / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper challenges the assumption that tool-augmented reasoning in LLM agents always improves reliability, showing cases where it can underperform native chain-of-thought (CoT) when semantic distractors are present.
  • It introduces a Factorized Intervention Framework to separate the performance effects of prompt formatting cost, tool-calling protocol overhead, and the real benefit gained from executing tools.
  • The analysis identifies a “tool-use tax”: performance degradation caused specifically by the tool-calling protocol, which can outweigh tool benefits under semantic noise.
  • To reduce protocol-induced errors, the authors propose G-STEP, a lightweight inference-time gating mechanism that partially recovers performance.
  • The findings conclude that to achieve larger gains, systems likely need improvements to the model’s intrinsic reasoning quality and its ability to interact with tools, not only better prompting or tool use.

Abstract

Tool-augmented reasoning has become a popular direction for LLM-based agents, and it is widely assumed to improve reasoning and reliability. However, we demonstrate that this consensus does not always hold: in the presence of semantic distractors, tool-augmented reasoning does not necessarily outperform native CoT. To explain this performance gap, we propose a Factorized Intervention Framework that isolates the cost of prompt formatting, the overhead of the tool-calling protocol, and the actual gain from executing tools. Our analysis reveals a critical tradeoff: under semantic noise, the gains from tools often fail to offset the "tool-use tax", which is the performance degradation introduced by the tool-calling protocol itself. To address this, we introduce G-STEP, a lightweight inference-time gate to mitigate protocol-induced errors. While this yields partial recovery, our findings suggest that more substantial improvements still require strengthening the model's intrinsic reasoning and tool-interaction capabilities.