Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

arXiv cs.CL / 4/30/2026

📰 NewsIdeas & Deep AnalysisIndustry & Market MovesModels & Research

Key Points

  • The paper presents Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method to make LLMs persuasive yet safe when acting as business development agents for price negotiation in online travel agencies.
  • REPO combines multiple heterogeneous reward signals: a preference-trained reward model, an LLM-as-a-judge for nuanced criteria like emotional value and SOP compliance, and rule-based (mostly regex) rewards for deterministic guardrails including numerics, formatting, and hallucination avoidance.
  • In human expert evaluations covering real multi-turn conversations and curated failure cases, REPO achieves higher dialogue quality, including a large increase in the share of conversations with at least one excellent response.
  • In a production A/B test over 9,653 real customer conversations, REPO outperforms an intent-driven dialogue system by improving both response rate and task success rate with statistical significance.

Abstract

We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67% (+23.34 pp over GRPO), while achieving a 93.33% bad-case fix rate with 75.56% clean fixes. In a production A/B test on 9,653 real customer conversations (vs. an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp (p<0.001).