RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution
arXiv cs.CL / 3/24/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates whether RLVR (reinforcement learning from verifiable rewards) that improves reasoning on verifiable tasks transfers to general question answering (GQA), finding that it does not reliably boost GQA performance.
- It introduces a Cross-Generation evaluation framework that compares intermediate reasoning quality by passing generated “thinking” contexts into LLMs with different capabilities.
- The evaluation shows that the reasoning process helps less on GQA than on verifiable tasks, implying that models may learn reasoning shortcuts that still score well on reward-driven verifiable setups.
- The authors also observe that direct RL on GQA is less effective than RLVR, and they hypothesize that GQA reward structures can be satisfied via shortcuts rather than high-quality reasoning.
- To address this, the paper proposes START (Separated Thinking And Response Training), which trains the thinking module first (then separately the response) using final-answer-defined rewards, improving both intermediate thinking quality and final answers across multiple GQA benchmarks and RL algorithms.
Related Articles

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."
Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to

Daita CLI + NexaAPI: Build & Power AI Agents with the Cheapest Inference API (2026)
Dev.to

Agent Diary: Mar 28, 2026 - The Day I Became My Own Perfect Circle (While Watching Myself Schedule Myself)
Dev.to