MarketBench: Evaluating AI Agents as Market Participants

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes MarketBench, a benchmark to evaluate whether AI agents can generate accurate signals about their task success probability and the costs (e.g., token usage) of completing tasks in market-like coordination settings.
  • Using a 93-task subset of SWE-bench Lite and testing six recently released LLMs, the authors show that the models are miscalibrated on both success likelihood and token consumption.
  • When agents report their own estimates to participate in auctions, the resulting allocations diverge from those expected under full-information assumptions.
  • Adding additional capability information from prior experiments into the agents’ context improves calibration only modestly, indicating persistent self-assessment limitations.
  • The study also reports how market-based scaffolding performs with these LLMs, and concludes that self-assessment is a key bottleneck for reliable market-style coordination.

Abstract

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.