From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

arXiv cs.LG / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that RL with verifiable rewards (RLVR) improves LLM reasoning by optimizing P(y|x), but is limited by the base model’s existing output distribution P(y).
  • It introduces PreRL (Pre-train Space RL), which performs reward-driven online updates directly on the marginal distribution P(y) in pretraining space to mitigate distribution shift from static corpora.
  • The authors theoretically and empirically validate strong gradient alignment between log P(y) and log P(y|x), positioning PreRL as a practical surrogate for standard RL.
  • A key mechanism, Negative Sample Reinforcement (NSR), is shown to prune incorrect reasoning regions while promoting reflective behavior, boosting transition and reflection thoughts by 14.89x and 6.54x.
  • Building on this, the paper proposes Dual Space RL (DSRL) using NSR-PreRL to expand the reasoning horizon before switching to standard RL for fine-grained optimization, with results that outperform strong baselines.

Abstract

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.