Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
arXiv cs.LG / 4/22/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper frames Safe RLHF as an infinite-horizon discounted constrained Markov decision process (CMDP) to reflect that humans may provide feedback during ongoing, continuing interactions rather than a single finite episode.
- It proposes two new Safe RLHF algorithms that avoid reward-model fitting and instead work directly under the CMDP formulation while allowing variable (flexible) trajectory lengths during training.
- The methods use a primal-dual optimization approach and provide global convergence guarantees, rather than relying only on empirical validation.
- The convergence results are characterized with polynomial rates in terms of policy-gradient iterations, trajectory sample lengths, and the number of human preference queries.
- The authors claim this is the first study of infinite-horizon discounted CMDP settings under human feedback with global, non-asymptotic convergence guarantees.



