Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

arXiv cs.LG / 2026/3/24

💬 オピニオンIdeas & Deep AnalysisModels & Research

共有:

要点

The paper analyzes episodic reinforcement learning from human feedback (RLHF) when preference labels come from multiple sources (annotators, experts, reward models, heuristics) that may systematically deviate from an ideal single objective.
It introduces a cumulative “imperfection budget” framework, bounding how much each feedback source’s preference probabilities can deviate from an oracle over K episodes.
The authors propose a unified algorithm with regret overr{O}(sqrt{K/M}+omega), showing “best-of-both-regimes” behavior: strong M-dependent gains when imperfections are small, and robustness via an unavoidable additive omega term when imperfections are large.
They provide a matching lower bound overr{Omega}(max{sqrt{K/M}, omega}) and demonstrate that treating imperfect feedback as oracle-consistent can lead to much worse regret.
The method relies on imperfection-adaptive weighted comparison learning, value-targeted transition estimation to manage distribution shift from feedback mismatch, and sub-importance sampling to keep training objectives analyzable.

Abstract

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most

\omega

over

K

episodes. We propose a unified algorithm with regret

\tilde{O}(\sqrt{K/M}+\omega)

, which exhibits a best-of-both-regimes behavior: it achieves

M

-dependent statistical gains when imperfection is small (where

M

is the number of sources), while remaining robust with unavoidable additive dependence on

\omega

when imperfection is large. We complement this with a lower bound

\tilde{\Omega}(\max\{\sqrt{K/M},\omega\})

, which captures the best possible improvement with respect to

M

and the unavoidable dependence on

\omega

, and a counterexample showing that na\"ively treating imperfect feedback as as oracle-consistent can incur regret as large as

\tilde{\Omega}(\min\{\omega\sqrt{K},K\})

. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

[野球の予測モデル] 次の1球で何が起こるのかを予測したい

Qiita

なんと397BのAIモデルをiPhoneで動かすことに成功

GIGAZINE

AI研究におけるボトルネックは人間

GIGAZINE

クレタ人のLLM

Zenn

生成AIが「下手な鉄砲」型サイバー攻撃を増やす、足元固めを急ごう

日経XTECH

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

要点

Abstract

関連記事

[野球の予測モデル] 次の1球で何が起こるのかを予測したい

なんと397BのAIモデルをiPhoneで動かすことに成功

AI研究におけるボトルネックは人間

クレタ人のLLM

生成AIが「下手な鉄砲」型サイバー攻撃を増やす、足元固めを急ごう

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer