TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

arXiv cs.CL / 4/24/2026

📰 NewsModels & Research

Key Points

  • The paper introduces TRACES, a lightweight framework that tags language reasoning model (LRM) steps in real time to enable adaptive, cost-efficient early stopping during inference.
  • By monitoring how different types of reasoning steps behave—especially after a correct answer is reached—the authors identify interpretable signals for when the model can stop generating.
  • The study finds that LRMs often change their reasoning behavior once they have produced a correct answer, suggesting opportunities to reduce unnecessary verification/reflection.
  • Experiments on MATH500, GSM8K, AIME, and knowledge/reasoning benchmarks MMLU and GPQA show 20–50% token reductions while preserving accuracy comparable to standard full-generation.
  • The approach focuses on addressing inefficiency in over-generation of reasoning steps, which remains underexplored at the level of step type and its contribution to correctness.

Abstract

The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.