One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

arXiv cs.CL / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ReQueR, a framework that reframes reasoning elicitation for LLMs as an inference-time alignment problem by rewriting user queries into explicit logical decompositions.
  • Instead of fine-tuning many models or relying on static prompts, ReQueR trains a dedicated “Refiner” policy with reinforcement learning while keeping the target LLMs frozen as the environment.
  • Using educational-psychology ideas (Zone of Proximal Development), the authors propose an Adaptive Solver Hierarchy curriculum that matches training difficulty to the Refiner’s improving skills to stabilize learning.
  • Experiments show consistent absolute gains of 1.7%–7.2% across multiple architectures and benchmarks, averaging a 2.1% improvement over strong baselines.
  • The method targets one-to-many inference-time reasoning elicitation, suggesting that a single Refiner trained on a small set of models can generalize to unlock reasoning in diverse unseen LLMs, with code released on GitHub.

Abstract

Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive O(N) costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner's evolving competence. ReQueR yields consistent absolute gains of 1.7\%--7.2\% across diverse architectures and benchmarks, outperforming strong baselines by 2.1\% on average. Crucially, it provides a promising paradigm for one-to-many inference-time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at https://github.com/newera-xiao/ReQueR.