Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
arXiv cs.CL / 4/27/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces two evaluation metrics—Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR)—to test whether chain-of-thought reasoning learned via RLVR actually drives and explains model answers.
- Experiments on Qwen2.5 and ReasoningGym show that although RLVR can improve task accuracy, it does not consistently increase CIR or SR, suggesting that reasoning chains may not be causally or evidentially central.
- The authors find that applying a small amount of supervised fine-tuning (SFT) before RLVR can improve low CIR/SR when RLVR alone falls short.
- They also demonstrate that CIR and SR can be improved without SFT by adding auxiliary CIR/SR rewards alongside outcome-based rewards, achieving RLVR-like accuracy with more causally important and sufficient reasoning.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to
We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to