To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

arXiv cs.AI / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that agentic LLM architectures gain power from external tools, but tool calls can also be redundant or harmful, making the “call vs. don’t call” decision central to effective use.
It proposes a decision-theory-inspired framework to evaluate web search tool calls using three factors: necessity, utility, and affordability.
The framework compares two perspectives—normative (inferring true need/utility from optimal tool-call allocation) and descriptive (inferring the model’s self-perceived need/utility from observed behavior)—and shows these often diverge.
It introduces lightweight estimators trained from models’ hidden states to predict need and utility, enabling simple controllers that improve tool-call decision quality and yield better task performance across multiple tasks and models.
Overall, the approach provides a principled method and measurable mechanism for optimizing LLM tool-calling rather than relying on the model’s internal, self-assessed judgments alone.

Abstract

Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and its ability to integrate potentially noisy tool responses. We introduce a principled framework inspired by decision-making theory to evaluate web search tool-use decisions along three key factors: necessity, utility, and affordability. Our analysis combines two complementary lenses: a normative perspective that infers true need and utility from an optimal allocation of tool calls, and a descriptive perspective that infers the model's self-perceived need and utility from their observed behaviors. We find that models' perceived need and utility of tool calls are often misaligned with their true need and utility. Building on this framework, we train lightweight estimators of need and utility based on models' hidden states. Our estimators enable simple controllers that can improve decision quality and lead to stronger task performance than the self-perceived set up across three tasks and six models.

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

Dev.to

When a memorized rule fits your bug too well: a meta-trap of agent workflows

Dev.to

To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

Key Points

Abstract

Related Articles

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

When a memorized rule fits your bug too well: a meta-trap of agent workflows

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer