Large Language Models Decide Early and Explain Later

arXiv cs.CL / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The study examines whether large language model answers are already determined early during chain-of-thought generation, and whether later reasoning becomes “post-decision” explanation that adds cost without improving correctness.
  • Using forced answer completion on Qwen3-4B across multiple datasets, the authors find the predicted final answer changes in only 32% of queries, and that after the final-answer switch the model generates about 760 extra reasoning tokens on average.
  • The results suggest substantial redundancy in chain-of-thought generation, implying that much of later reasoning may not contribute to changing the final answer.
  • The paper proposes early-stopping strategies (including probe-based stopping) that halt generation once the answer stabilizes, reducing reasoning tokens by about 500 per query while causing only a ~2% drop in accuracy.
  • Overall, the work motivates inference-time techniques to cut latency and inference cost by stopping redundant reasoning while largely preserving performance.

Abstract

Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.