Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
arXiv cs.CL / 4/30/2026
📰 NewsIdeas & Deep AnalysisIndustry & Market MovesModels & Research
Key Points
- The paper presents Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method to make LLMs persuasive yet safe when acting as business development agents for price negotiation in online travel agencies.
- REPO combines multiple heterogeneous reward signals: a preference-trained reward model, an LLM-as-a-judge for nuanced criteria like emotional value and SOP compliance, and rule-based (mostly regex) rewards for deterministic guardrails including numerics, formatting, and hallucination avoidance.
- In human expert evaluations covering real multi-turn conversations and curated failure cases, REPO achieves higher dialogue quality, including a large increase in the share of conversations with at least one excellent response.
- In a production A/B test over 9,653 real customer conversations, REPO outperforms an intent-driven dialogue system by improving both response rate and task success rate with statistical significance.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to
Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to
Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to