ReDAct: Uncertainty-Aware Deferral for LLM Agents

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ReDAct (Reason-Defer-Act), an LLM-agent method that reduces hallucination-driven errors in sequential decision tasks by deferring uncertain steps.
ReDAct uses two models: a small, low-cost LLM by default and a larger, more reliable (but more expensive) LLM only when the small model’s predictive uncertainty exceeds a calibrated threshold.
The authors evaluate the approach in text-based embodied environments (ALFWorld and MiniGrid), showing that deferring roughly 15% of decisions to the large model can achieve near-quality to always using the large model.
The results indicate substantial inference cost savings while preserving decision quality, addressing the common tradeoff between reliability and per-token expense in larger LLMs.
The approach relies on uncertainty estimation and threshold calibration to decide when the agent should “defer” its reasoning/acting to a stronger model.

Abstract

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.