Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

arXiv stat.ML / 3/25/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses how to perform privacy-preserving RLHF when human preference data may contain sensitive user information by applying differential privacy specifically to the reward-learning stage rather than the entire pipeline.
It proposes deriving the final policy from a privately learned reward model, aligning the method with the distinct structure of reinforcement learning from human feedback.
The authors provide theoretical analyses including bounds on the suboptimality gap, showing that privacy adds an additional term beyond standard (non-private) statistical error.
They also prove minimax lower bounds and identify how the dominant error term changes depending on sample size and privacy level, yielding regimes where the proposed upper bound is rate-optimal up to logarithmic factors.
Empirical results on synthetic experiments and on the Anthropic HH-RLHF dataset with Gemma-2B-IT indicate improved private alignment performance versus existing differentially private baselines across privacy budgets.

Abstract

Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. Theoretically, we study the suboptimality gap and show that privacy contributes an additional additive term beyond the usual non-private statistical error. We also establish a minimax lower bound and show that the dominant term changes with sample size and privacy level, which in turn characterizes regimes in which the upper bound is rate-optimal up to logarithmic factors. Empirically, synthetic experiments confirm the scaling predicted by the theory, and experiments on the Anthropic HH-RLHF dataset using the Gemma-2B-IT model show stronger private alignment performance than existing differentially private baseline methods across privacy budgets.

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

Scaffolded Test-First Prompting: Get Correct Code From the First Run

Dev.to

Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling

Key Points

Abstract

Related Articles

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Scaffolded Test-First Prompting: Get Correct Code From the First Run

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer