When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective
arXiv cs.LG / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that unsupervised RL with intrinsic rewards can scale mathematical reasoning in LLMs by avoiding costly ground-truth annotations.
- It designs and evaluates intrinsic rewards that explicitly promote concise and certain generation to mitigate instability and reward hacking.
- It screens base models across a range of intrinsic reasoning capabilities to reveal how a model's foundational logical priors influence success or failure.
- It introduces a geometric diagnostic lens based on manifolds to explain why some configurations stabilize while others collapse, and when the unsupervised approach is likely to fail.




