CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
arXiv cs.AI / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- RLVR previously relied only on final outcomes for rewards, which can lead to process-level errors and hallucinations in the model's reasoning.
- CLIPO introduces a contrastive learning objective that operates over successful rollouts to learn an invariant structure across correct reasoning paths, providing stronger cross-trajectory regularization than single-path supervision.
- This approach mitigates step-level reasoning inconsistencies and reduces hallucinations, improving generalization and robustness in LLM policy optimization.
- Experiments show that CLIPO consistently improves RLVR baselines across diverse reasoning benchmarks, and the authors provide code and training recipes on GitHub.




