ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
arXiv cs.AI / 4/1/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces ShapE-GRPO, a Shapley-value–enhanced variant of Group Relative Policy Optimization designed for multi-candidate LLM training where the goal is to maximize set-level utility rather than individual-candidate quality.
- It argues that existing GRPO-style methods give identical scalar rewards to all candidates, causing noisy gradients and allowing weaker candidates to “free-ride” on strong peers’ rewards.
- ShapE-GRPO decomposes set-level rewards into candidate-specific signals using a cooperative-game-theory formulation, preserving Shapley value axioms while keeping computation efficient in polynomial time.
- Experiments on multiple datasets show consistent improvements over standard GRPO, including faster convergence during training.
Related Articles

Black Hat Asia
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Day 6: I Stopped Writing Articles and Started Hunting Bounties
Dev.to

Early Detection of Breast Cancer using SVM Classifier Technique
Dev.to

I Started Writing for Others. It Changed How I Learn.
Dev.to