Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

arXiv cs.AI / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Market-Bench, a configurable multi-agent benchmark designed to test large language models on economically and trade-relevant tasks such as procurement and retailing.
In the procurement stage, LLMs participate in budget-constrained auctions to bid for limited inventory, and in the retail stage they set prices and generate marketing slogans for role-based buyer attention.
Market-Bench records full interaction trajectories—including bids, prices, slogans, sales, and balance-sheet states—so evaluations can combine economic/operational outcomes with semantic scoring.
Experiments across 20 open- and closed-source LLM agents show substantial performance gaps and a “winner-take-most” dynamic, where only a small fraction consistently achieve capital appreciation while many stay near break-even.
The authors position Market-Bench as a reproducible testbed for studying how LLMs behave and compete in simulated markets under constrained resources and competition.

Abstract

The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbf{retail} stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textit{i.e.}, only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.

Black Hat Asia

AI Business

Meta's latest model is as open as Zuckerberg's private school

The Register

AI fuels global trade growth as China-US flows shift, McKinsey finds

SCMP Tech

Why multi-agent AI security is broken (and the identity patterns that actually work)

Dev.to

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

Reddit r/artificial

Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

Key Points

Abstract

Related Articles

Black Hat Asia

Meta's latest model is as open as Zuckerberg's private school

AI fuels global trade growth as China-US flows shift, McKinsey finds

Why multi-agent AI security is broken (and the identity patterns that actually work)

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer