SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
arXiv cs.CL / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that RL fine-tuning for reasoning models often over-optimizes single-sample success (Pass@1) while limiting exploration of diverse reasoning paths needed for better multi-sample performance (Pass@k).
- It attributes this to a “probability mass squeezing” effect, where probability becomes overly concentrated on a small set of high-reward trajectories, reducing genuine trajectory diversity.
- To counter the squeezing, the authors propose Steering Probability Squeezing (SPS), which alternates standard RL with inverse reinforcement learning (IRL) and treats on-policy rollouts as demonstrations to reshape the trajectory distribution.
- Experiments on five reasoning benchmarks show SPS improves exploration and yields higher Pass@k, and the work also analyzes RL learning dynamics to estimate an empirical upper bound on achievable Pass@k.
- Overall, the results suggest that alternating RL and IRL can extend the intrinsic exploration capability of RL-trained large language model reasoning systems without relying on external supervision.
Related Articles

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA