Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

arXiv cs.AI / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether Reinforcement Learning from Verifiable Rewards (RLVR) can teach LLM-based agents to negotiate in incomplete-information settings like bilateral price bargaining.
It presents a training framework where a mid-sized buyer agent negotiates against a regulated seller LLM across a broad set of real-world products, with rewards grounded in economic surplus maximization and enforced private budget constraints.
The authors report a four-phase strategic evolution during training, progressing from naive bargaining to aggressive opening bids, through deadlock behaviors, and finally to advanced persuasive tactics.
Results indicate the trained ~30B buyer agent can substantially outperform much larger frontier models in extracting surplus (described as outperforming models over ten times its size) while also generalizing to stronger and previously unseen counterparties, including hostile adversarial seller personas.
The work suggests that verifiable reward design can meaningfully improve LLM negotiation competence and robustness beyond what standard prompting or non-verifiable training might achieve.

Abstract

The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.