SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses why standard token-level PPO is unstable for long-horizon chain-of-thought (CoT) reasoning, where temporal credit assignment becomes difficult and memory costs for the value model can be prohibitive.
  • It proposes Sequence-Level PPO (SPPO), which reframes reasoning as a Sequence-Level Contextual Bandit and uses a decoupled scalar value function to compute low-variance advantages.
  • SPPO is designed to retain PPO’s sample efficiency while improving update stability, avoiding the multi-sampling and baseline-estimation overhead common in critic-free alternatives like GRPO.
  • Experiments on mathematical benchmarks show SPPO outperforms standard PPO and reaches performance comparable to more computation-heavy group-based methods, but with better resource efficiency.
  • Overall, SPPO offers a scalable training approach for aligning reasoning LLMs with verifiable rewards, particularly in long-horizon settings.

Abstract

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.