Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization
arXiv cs.LG / 4/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes CW-GRPO (Contribution-Weighted Group Relative Policy Optimization) to improve reinforcement learning for LLM-based search agents by better handling credit assignment across a search trajectory.
- Instead of relying on unstable process rewards or sparse trajectory-level outcome rewards, CW-GRPO uses an LLM judge to score retrieval utility and reasoning correctness at each search round.
- These per-round contribution scores are used to rescale outcome-based advantages, enabling finer-grained credit assignment while maintaining training stability.
- Experiments on multiple knowledge-intensive benchmarks show CW-GRPO outperforms standard GRPO by 5.0% on Qwen3-8B and 6.3% on Qwen3-1.7B, indicating more effective search behaviors.
- The analysis suggests that successful trajectories tend to concentrate high contributions in particular rounds, offering empirical guidance for understanding what makes search agents succeed.


![[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Flu4b6ttuhur71z5gemm0.png&w=3840&q=75)
