Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SalesLLM, a bilingual (ZH/EN) benchmark for evaluating LLMs on realistic, multi-turn sales dialogues with measurable deal progression and end outcomes.
SalesLLM is built from 30,074 scripted configurations and 1,805 curated scenarios with controllable difficulty, personas, and coverage across Financial Services and Consumer Goods.
The evaluation pipeline is fully automatic, using an LLM-based rater for sales-process progress and fine-tuned BERT classifiers to predict buying intent at the end of dialogues.
To improve simulation fidelity, the authors train a customer behavior model (CustomerLM) with SFT and DPO, reducing role inversion from 17.44% (GPT-4o) to 8.8%.
Results show strong correlation with expert human ratings (Pearson r=0.98) and significant performance variation across 15 mainstream LLMs, indicating the benchmark can help develop outcome-oriented sales agents.

Abstract

Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer