When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

arXiv cs.AI / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how large reasoning models (LRMs) can be unreliable on difficult mathematical instances and how existing test-time scaling often has diminishing returns.
  • It finds that disagreement among outputs is a strong indicator of instance difficulty and prediction correctness, enabling more informed strategy choice at test time.
  • The authors propose a training-free, instance-level routing framework that selects among scaling strategies per input rather than uniformly spending more computation on every case.
  • The method uses lightweight resolution for consistent outputs, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous cases.
  • Experiments across seven math benchmarks and three models show 3%–7% accuracy gains while lowering sampling cost versus prior approaches.

Abstract

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.