Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

arXiv cs.LG / 2026/4/6

📰 ニュースIdeas & Deep AnalysisModels & Research

共有:

要点

The paper addresses reward hacking in RLHF, where reinforcement learning against a learned reward model can lead to true response quality plateauing or degrading.
It argues that a key failure mode is “flipped advantage signs,” where an incorrect sign makes policy updates increase the likelihood of bad responses.
By applying adversarial perturbations in the reward model parameter space, the authors derive a certified sign-preservation radius indicating the minimum perturbation needed to flip the advantage sign.
They introduce Sign-Certified Policy Optimization (SignCert-PO), which down-weights policy-gradient contributions from non-robust (sign-unstable) completions.
Experiments on TL;DR summarization and AlpacaFarm benchmarks show improved win rates over baselines and reduced reward hacking, with the method requiring only the RM parameters and on-policy completions at optimization time.

Abstract

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

#毎日ここへ立ち寄りたいからスクランブルな日のワタシのココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇

note

AIが見つけた紛失カッターナイフ

note

【限定コラム】四月の風と見えない魔法──五十歳のオッサンが新入社員に贈る、現場のAI用語20選

note

メイクのアドバイスも！「男の娘」のAI彼氏の作り方【AI性格プロンプト付】

note

AGIの次は「ASI」？知能がタダになる時代、60代会社員が「100歳まで逃げ切る」痛快な生存戦略

note

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

要点

Abstract

関連記事

#毎日ここへ立ち寄りたいからスクランブルな日のワタシのココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇

AIが見つけた紛失カッターナイフ

【限定コラム】四月の風と見えない魔法──五十歳のオッサンが新入社員に贈る、現場のAI用語20選

メイクのアドバイスも！「男の娘」のAI彼氏の作り方【AI性格プロンプト付】

AGIの次は「ASI」？知能がタダになる時代、60代会社員が「100歳まで逃げ切る」痛快な生存戦略

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

#毎日 ここへ 立ち寄りたいから スクランブルな日の ワタシの ココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇

AIが見つけた紛失カッターナイフ

【限定コラム】四月の風と見えない魔法──五十歳のオッサンが新入社員に贈る、現場のAI用語20選

メイクのアドバイスも！「男の娘」のAI彼氏の作り方【AI性格プロンプト付】

AGIの次は「ASI」？ 知能がタダになる時代、60代会社員が「100歳まで逃げ切る」痛快な生存戦略

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

#毎日ここへ立ち寄りたいからスクランブルな日のワタシのココロの足跡スタンプ👣🌌#私のインスピレーション ✕ #AIと紡いだ光のカケラ🧡 :🌎地球家族は愛し合える🌏🌍 #⭐永遠時計🕊️🍇

AGIの次は「ASI」？知能がタダになる時代、60代会社員が「100歳まで逃げ切る」痛快な生存戦略