Skip-Connected Policy Optimization for Implicit Advantage

arXiv cs.LG / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that while dense, token-level rewards could improve RLVR performance, Monte Carlo estimation under practical sampling budgets produces high-variance, sign-inconsistent advantages for early reasoning tokens, causing outcome-only GRPO to outperform in practice.
  • It introduces Skip-Connected Optimization (SKPO), which splits reasoning into upstream and downstream phases and uses downstream Monte Carlo sampling to provide dense rewards to upstream under single-stream optimization.
  • For the downstream phase, SKPO retains group-relative policy optimization and adds a skip connection that concatenates the upstream segment with the original problem, allowing the model to use good upstream reasoning but bypass flawed parts via direct access to the problem.
  • Experiments report relative gains of 3.91% on Qwen2.5-Math-7B and 6.17% on Llama-3.2-3B over the strongest baselines across math and out-of-domain reasoning/code benchmarks.
  • The authors attribute benefits to an “implicit advantage,” where SKPO produces higher-quality intermediate steps even when final correctness is matched.

Abstract

Group Relative Policy Optimization (GRPO) has proven effective in RLVR by using outcome-based rewards. While fine-grained dense rewards can theoretically improve performance, we reveal that under practical sampling budgets, Monte Carlo estimation yields high-variance and sign-inconsistent advantages for early reasoning tokens, paradoxically underperforming outcome-only GRPO. We propose Skip-Connected Optimization (SKPO), which decomposes reasoning into upstream and downstream phases: upstream receives dense rewards from downstream Monte Carlo sampling with single-stream optimization; downstream maintains group-relative optimization, where a skip connection concatenates the upstream segment with the original problem, enabling the model to leverage helpful upstream reasoning while preserving the freedom to bypass flawed reasoning through direct problem access. Experiments demonstrate improvements of 3.91% and 6.17% relative gains over the strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B respectively across mathematical benchmarks and out-of-domain tasks including general reasoning and code generation. Further analysis reveals an implicit advantage: SKPO generates trajectories with higher intermediate-step quality even when matched for final correctness.