When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO
arXiv cs.AI / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper reexamines Group Relative Policy Optimization (GRPO), noting that GRPO treats each output as an independent sample and misses the contrast between correct and incorrect solutions within the same group.
- It introduces Bilateral Context Conditioning (BICC), enabling cross-reference of successful and failed reasoning traces during optimization without additional sampling or auxiliary models.
- It adds Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusting the advantage baseline using reward-confidence covariance derived from a first-order variance-minimizing estimator.
- The proposed methods yield a contrastive reformulation of GRPO with empirical improvements on mathematical reasoning benchmarks across multiple models and algorithms, and the code is released on GitHub.




