Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper frames RLHF as a decision problem under objective misspecification, where the learned proxy reward can diverge from true human utility and cause Goodharting through reward over-optimization.
  • It proposes Wasserstein distributionally robust regret optimization (DRRO), which pessimizes worst-case regret relative to the best policy under plausible reward perturbations rather than pessimizing worst-case value.
  • The authors analyze a promptwise formulation (via a simplex allocation model) and prove that with an ℓ1 ambiguity set the worst-case inner regret has an exact solution and yields an optimal policy with a water-filling structure.
  • They derive a practical policy-gradient algorithm that can be integrated with PPO/GRPO-style RLHF training using only minor modifications, supported by experiments showing improved mitigation of over-optimization versus existing baselines.
  • The framework provides a theoretical explanation for why DRRO is less pessimistic than standard DRO while achieving better empirical robustness to proxy-reward overfitting.

Abstract

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an \ell_1 ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.