PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

arXiv cs.LG / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

PolyBench is introduced as a multimodal benchmark that evaluates LLMs on live prediction-market tasks by pairing time-locked Polymarket snapshots with both Central Limit Order Book (CLOB) states and a real-time news stream.
The benchmark covers 38,666 binary prediction markets across 4,997 events and records synchronized point-in-time cross-sections during Feb 6–12, 2026.
Seven state-of-the-art LLMs (open- and closed-source) were run under identical, timestamp-locked conditions to produce 36,165 predictions.
Results show a major gap between confidence and financial usefulness: only two models delivered positive returns in the simulated order-book execution, with MiMo-V2-Flash (17.6% CWR) and Gemini-3-Flash (6.2% CWR) leading.
The paper claims PolyBench provides a contamination-proof, financially grounded evaluation standard for future research into LLM forecasting and trading under real market uncertainty.

Abstract

Predicting real-world events from live market signals demands systems that fuse qualitative news with quantitative order-book dynamics under strict temporal discipline -- a challenge existing benchmarks fail to capture. We present \textbf{PolyBench}, a multimodal benchmark derived from Polymarket that records point-in-time cross-sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real-time news stream. Using PolyBench, we evaluate seven state-of-the-art Large Language Models -- spanning open- and closed-source families -- generating 36,165 predictions under identical, timestamp-locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio via realistic order-book execution simulation. The results reveal a pronounced performance divergence: only two of seven models achieve positive financial returns -- MiMo-V2-Flash at \textbf{17.6%} CWR and Gemini-3-Flash at 6.2% CWR -- while the remaining five incur losses despite uniformly high stated confidence. These findings highlight the gap between surface-level language fluency and genuine probabilistic reasoning under live market uncertainty, and establish PolyBench as a contamination-proof, financially-grounded evaluation standard for future LLM research. Our dataset and code available at \underline{\href{https://github.com/PolyBench/PolyBench}{https://github.com/PolyBench/PolyBench}}.