Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment
arXiv cs.AI / 3/12/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard Group Relative Policy Optimization (GRPO) assumes exchangeable samples, which can bias learning toward dominant preferences in LLM alignment.
- It introduces Personalized GRPO (P-GRPO), which decouples advantage estimation from immediate batch statistics by normalizing advantages against preference-group-specific reward histories.
- Across diverse tasks, P-GRPO demonstrates faster convergence and higher rewards than standard GRPO, improving alignment with heterogeneous user preferences without sacrificing general capabilities.
- The work highlights the importance of accounting for reward heterogeneity at the optimization level for building models that faithfully align with diverse human preferences.




