RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether RLVR (reinforcement learning from verifiable rewards) that improves reasoning on verifiable tasks transfers to general question answering (GQA), finding that it does not reliably boost GQA performance.
  • It introduces a Cross-Generation evaluation framework that compares intermediate reasoning quality by passing generated “thinking” contexts into LLMs with different capabilities.
  • The evaluation shows that the reasoning process helps less on GQA than on verifiable tasks, implying that models may learn reasoning shortcuts that still score well on reward-driven verifiable setups.
  • The authors also observe that direct RL on GQA is less effective than RLVR, and they hypothesize that GQA reward structures can be satisfied via shortcuts rather than high-quality reasoning.
  • To address this, the paper proposes START (Separated Thinking And Response Training), which trains the thinking module first (then separately the response) using final-answer-defined rewards, improving both intermediate thinking quality and final answers across multiple GQA benchmarks and RL algorithms.

Abstract

Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediate reasoning by feeding the generated thinking context into LLMs of varying capabilities. Our evaluation leads to a discouraging finding: the efficacy of the thinking process on GQA tasks is markedly lower than on verifiable tasks, suggesting that explicit training on GQA remains necessary in addition to training on verifiable tasks. We further observe that direct RL training on GQA is less effective than RLVR. Our hypothesis is that, whereas verifiable tasks demand robust logical chains to obtain high rewards, GQA tasks often admit shortcuts to high rewards without cultivating high-quality thinking. To avoid possible shortcuts, we introduce a simple method, Separated Thinking And Response Training (START), which first trains only the thinking process, using rewards defined on the final answer. We show that START improves both the quality of thinking and the final answer across several GQA benchmarks and RL algorithms.