When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper analyzes tool-integrated reasoning (TIR) for large reasoning models and finds a tendency to either over-trust internal reasoning when it conflicts with tool outputs or to ignore correct tool results (“Tool Ignored”).
It argues that current tool-integrated models lack a reliable mechanism to decide when to trust or disregard tool execution outcomes.
To address this, the authors propose Adaptive Tool Trust Calibration (ATTC), which adaptively decides whether to trust tool results using the confidence score of generated code blocks.
Experiments across multiple open-source TIR models, dataset types, and model sizes show ATTC reduces the “Tool Ignored” failure mode and improves overall performance by 4.1% to 7.5%.

Abstract

Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.

Black Hat USA

AI Business

Black Hat Asia

AI Business

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer