Online Distributionally Robust LLM Alignment via Regression to Relative Reward

arXiv stat.ML / 4/20/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper targets RLHF overoptimization in offline settings, where models can overfit training inaccuracies and drift from preferred behaviors, motivating distributionally robust optimization (DRO) for alignment.
  • It introduces DRO-REBEL, an online DRO-based variant of REBEL using type-p Wasserstein, KL, and chi-squared ambiguity sets, reformulating each update as a relative-reward regression through strong duality.
  • Unlike prior DRO-DPO approaches, DRO-REBEL avoids sample inefficiency, heterogeneous-preference neglect, and brittle heuristics, and it does not rely on PPO-style clipping or value networks.
  • The authors provide theoretical convergence/error bounds—oodtilde{O}(sqrt(d/n)) under stated assumptions and an improved oodtilde{O}(d/n) parametric rate under preference shift—alongside divergence-specific, tractable SGD algorithms.
  • Experiments on Emotion Alignment (ArmoRM) and HH-Alignment show DRO-REBEL outperforming both robust and non-robust baselines across unseen preference mixtures, model sizes, and dataset scales.

Abstract

Reinforcement Learning with Human Feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where language models degrade by overfitting inaccuracies and drifting from preferred behaviors observed during training. Distributionally robust optimization (DRO) is a natural solution, but existing DRO-DPO methods are sample-inefficient, ignore heterogeneous preferences, and lean on brittle heuristics. We introduce \emph{DRO-REBEL}, a family of robust online REBEL updates built on type-p Wasserstein, Kullback-Leibler (KL), and \chi^2 ambiguity sets. Strong duality reduces each update to a relative-reward regression, retaining REBEL's scalability without PPO-style clipping or value networks. Under linear rewards, log-linear policies, and a standard coverage condition, we prove \widetilde{O}(\sqrt{d/n}) bounds on squared parameter error, with sharper constants than prior DRO-DPO analyses, and give the first parametric \widetilde{O}(d/n) rate for DRO-based alignment under preference shift, matching non-robust RLHF in benign regimes. Each divergence yields a tractable SGD-based algorithm: gradient regularization for Wasserstein, importance weighting for KL, and a 1-D dual solve for \chi^2. On Emotion Alignment, the ArmoRM multi-objective benchmark, and HH-Alignment, DRO-REBEL outperforms prior robust and non-robust baselines across unseen preference mixtures, model sizes, and dataset scales.