FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds a counterintuitive pattern in large reasoning models (LRMs): the first generated solution is often the best, while later alternative solutions can be actively harmful rather than merely worse.
It challenges common test-time scaling assumptions by proposing that reasoning-path errors grow alongside test time, modeled as a forest-structured “Forest of Errors” (FoE).
Based on these insights, the authors introduce RED, a self-guided efficient reasoning framework that both refines the first solution and prunes subsequent reasoning using a dual-consistency approach.
Experiments across five benchmarks and six backbone models show RED improves performance by up to 19.0% while cutting token usage substantially (about 37.7% to 70.4%), outperforming eight baselines.
FoE-related diagnostic experiments are used to explain how and why RED reduces the growth of harmful alternative-solution errors.

Abstract

Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.