SSPO: Subsentence-level Policy Optimization

arXiv cs.CL / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies stability problems in existing RLVR post-training methods: GRPO can collapse due to token-level importance ratios that overemphasize outliers, while GSPO can remain unstable when response-level clipping effectively retains entire high-variance responses.
It proposes SSPO (Subsentence-level Policy Optimization), which computes importance ratios at the subsentence level to balance variance reduction and prevent GRPO/GSPO clipping failure modes.
SSPO further improves PPO-CLIP by adding subsentence-level entropy to adapt clipping bounds, tightening them for low-entropy tokens while allowing more exploration for high-entropy regions.
Experiments on Qwen2.5-1.5B-Math show SSPO achieves an average score of 46.72 across five datasets, beating GRPO (43.01) and GSPO (44.42), with state-of-the-art results on four datasets.
On Qwen2.5-7B-Math, SSPO again leads on averaged scores over five baseline methods, supporting the claim that it improves RLVR effectiveness for math reasoning.

Abstract

As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues: GRPO (Group Relative Policy Optimization) often suffers from unstable policy updates, while GSPO (Group Sequence Policy Optimization) can retain high-variance tokens. In GRPO, the importance ratio is computed at the token level, which overemphasizes individual tokens and makes learning sensitive to outliers, potentially causing training collapse. GSPO instead computes a response-level importance ratio, mitigating variance and reducing the accumulation of token-level noise present in GRPO. Nevertheless, our experiments show that GSPO frequently yields a near-zero clipping fraction: extreme token-level ratios can be diluted by other tokens in the same response, causing the entire response to be retained and resulting in unstable updates. We propose SSPO, which computes importance ratios at the subsentence level, striking a balance between GRPO and GSPO. SSPO alleviates training collapse and excessive variance while avoiding the failure mode in which the clipping mechanism indiscriminately retains entire responses. Moreover, we incorporate subsentence-level entropy into PPO-CLIP to adaptively adjust the clipping bounds: we encourage exploration for high-entropy tokens while tightening the clipping range for low-entropy tokens. Empirically, SSPO achieves an average score of 46.72 across five datasets on Qwen2.5-1.5B-Math model, outperforming GRPO (43.01) and GSPO (44.42), and attains state-of-the-art results on four datasets. On Qwen2.5-7B-Math model, SSPO also achieves the highest averaged scores over five baseline methods. These results demonstrate SSPO's effectiveness in RLVR.