Uncertainty Quantification for LLM Function-Calling

arXiv cs.CL / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper studies how to apply Uncertainty Quantification (UQ) to LLM function-calling so the system can judge confidence before executing irreversible actions.
It reports what it claims is the first evaluation of UQ methods specifically for LLM Function-Calling, not just general question answering.
The authors find that multi-sample UQ approaches like Semantic Entropy do not provide clear benefits over simpler single-sample UQ methods in the function-calling setting.
They propose function-calling-specific improvements: clustering function-call outputs by abstract syntax tree (AST) structure for multi-sample methods, and using only semantically meaningful tokens to compute logit-based uncertainty for single-sample methods.
Overall, the work suggests that leveraging the structure of function-calling outputs can meaningfully improve confidence estimation and reduce the risk of incorrect tool use.

Abstract

Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

Reddit r/artificial

langchain-tests==1.1.7

LangChain Releases

Why isn’t LLM reasoning done in vector space instead of natural language?

Reddit r/LocalLLaMA

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

Reddit r/LocalLLaMA

Uncertainty Quantification for LLM Function-Calling

Key Points

Abstract

Related Articles

An improvement of the convergence proof of the ADAM-Optimizer

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

langchain-tests==1.1.7

Why isn’t LLM reasoning done in vector space instead of natural language?

llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer