SSPO: Subsentence-level Policy Optimization
arXiv cs.CL / 4/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies stability problems in existing RLVR post-training methods: GRPO can collapse due to token-level importance ratios that overemphasize outliers, while GSPO can remain unstable when response-level clipping effectively retains entire high-variance responses.
- It proposes SSPO (Subsentence-level Policy Optimization), which computes importance ratios at the subsentence level to balance variance reduction and prevent GRPO/GSPO clipping failure modes.
- SSPO further improves PPO-CLIP by adding subsentence-level entropy to adapt clipping bounds, tightening them for low-entropy tokens while allowing more exploration for high-entropy regions.
- Experiments on Qwen2.5-1.5B-Math show SSPO achieves an average score of 46.72 across five datasets, beating GRPO (43.01) and GSPO (44.42), with state-of-the-art results on four datasets.
- On Qwen2.5-7B-Math, SSPO again leads on averaged scores over five baseline methods, supporting the claim that it improves RLVR effectiveness for math reasoning.
Related Articles

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)
Dev.to

Free AI Tools With No Message Limits — The Definitive List (2026)
Dev.to

Why Domain Knowledge Is Critical in Healthcare Machine Learning
Dev.to