Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “Verbal Process Supervision (VPS),” a training-free inference-time framework that improves LLM reasoning by iteratively generating, critiquing, and refining outputs using structured natural-language critique from a stronger supervisor.
  • VPS introduces a new scaling axis—critique granularity—alongside existing approaches like deeper chains, wider sampling, and learned step scorers (PRMs).
  • On GPQA Diamond, VPS lets GPT-5.4 variants reach 94.9% at R=4, beating prior state of the art (94.1%) without gradient updates.
  • On AIME 2025, VPS achieves “weak-actor rescue,” dramatically raising performance from 11.7–26.7% to 63.3–90.0% by guiding weaker models using verbal critique.
  • Across GPQA and LiveCodeBench V6, VPS outperforms methods such as Reflexion and Self-Consistency at matched compute, with results correlating strongly to the supervisor–actor capability gap while degrading when errors can’t be expressed linguistically (e.g., code synthesis).

Abstract

Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.