Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

arXiv cs.LG / 2026/3/24

💬 オピニオンIdeas & Deep AnalysisModels & Research

要点

  • The paper analyzes episodic reinforcement learning from human feedback (RLHF) when preference labels come from multiple sources (annotators, experts, reward models, heuristics) that may systematically deviate from an ideal single objective.
  • It introduces a cumulative “imperfection budget” framework, bounding how much each feedback source’s preference probabilities can deviate from an oracle over K episodes.
  • The authors propose a unified algorithm with regret overr{O}(sqrt{K/M}+omega), showing “best-of-both-regimes” behavior: strong M-dependent gains when imperfections are small, and robustness via an unavoidable additive omega term when imperfections are large.
  • They provide a matching lower bound overr{Omega}(max{sqrt{K/M}, omega}) and demonstrate that treating imperfect feedback as oracle-consistent can lead to much worse regret.
  • The method relies on imperfection-adaptive weighted comparison learning, value-targeted transition estimation to manage distribution shift from feedback mismatch, and sub-importance sampling to keep training objectives analyzable.

Abstract

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most \omega over K episodes. We propose a unified algorithm with regret \tilde{O}(\sqrt{K/M}+\omega), which exhibits a best-of-both-regimes behavior: it achieves M-dependent statistical gains when imperfection is small (where M is the number of sources), while remaining robust with unavoidable additive dependence on \omega when imperfection is large. We complement this with a lower bound \tilde{\Omega}(\max\{\sqrt{K/M},\omega\}), which captures the best possible improvement with respect to M and the unavoidable dependence on \omega, and a counterexample showing that na\"ively treating imperfect feedback as as oracle-consistent can incur regret as large as \tilde{\Omega}(\min\{\omega\sqrt{K},K\}). Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.