From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
arXiv cs.LG / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that RL with verifiable rewards (RLVR) improves LLM reasoning by optimizing P(y|x), but is limited by the base model’s existing output distribution P(y).
- It introduces PreRL (Pre-train Space RL), which performs reward-driven online updates directly on the marginal distribution P(y) in pretraining space to mitigate distribution shift from static corpora.
- The authors theoretically and empirically validate strong gradient alignment between log P(y) and log P(y|x), positioning PreRL as a practical surrogate for standard RL.
- A key mechanism, Negative Sample Reinforcement (NSR), is shown to prune incorrect reasoning regions while promoting reflective behavior, boosting transition and reflection thoughts by 14.89x and 6.54x.
- Building on this, the paper proposes Dual Space RL (DSRL) using NSR-PreRL to expand the reasoning horizon before switching to standard RL for fine-grained optimization, with results that outperform strong baselines.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to