MarketBench: Evaluating AI Agents as Market Participants

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes MarketBench, a benchmark to evaluate whether AI agents can generate accurate signals about their task success probability and the costs (e.g., token usage) of completing tasks in market-like coordination settings.
Using a 93-task subset of SWE-bench Lite and testing six recently released LLMs, the authors show that the models are miscalibrated on both success likelihood and token consumption.
When agents report their own estimates to participate in auctions, the resulting allocations diverge from those expected under full-information assumptions.
Adding additional capability information from prior experiments into the agents’ context improves calibration only modestly, indicating persistent self-assessment limitations.
The study also reports how market-based scaffolding performs with these LLMs, and concludes that self-assessment is a key bottleneck for reliable market-style coordination.

Abstract

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

Dev.to

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

Dev.to

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

Dev.to

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

MarketBench: Evaluating AI Agents as Market Participants

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

An improvement of the convergence proof of the ADAM-Optimizer

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer