Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SalesLLM, a bilingual (ZH/EN) benchmark for evaluating LLMs on realistic, multi-turn sales dialogues with measurable deal progression and end outcomes.
  • SalesLLM is built from 30,074 scripted configurations and 1,805 curated scenarios with controllable difficulty, personas, and coverage across Financial Services and Consumer Goods.
  • The evaluation pipeline is fully automatic, using an LLM-based rater for sales-process progress and fine-tuned BERT classifiers to predict buying intent at the end of dialogues.
  • To improve simulation fidelity, the authors train a customer behavior model (CustomerLM) with SFT and DPO, reducing role inversion from 17.44% (GPT-4o) to 8.8%.
  • Results show strong correlation with expert human ratings (Pearson r=0.98) and significant performance variation across 15 mainstream LLMs, indicating the benchmark can help develop outcome-oriented sales agents.

Abstract

Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.