PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
arXiv cs.LG / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- PolyBench is introduced as a multimodal benchmark that evaluates LLMs on live prediction-market tasks by pairing time-locked Polymarket snapshots with both Central Limit Order Book (CLOB) states and a real-time news stream.
- The benchmark covers 38,666 binary prediction markets across 4,997 events and records synchronized point-in-time cross-sections during Feb 6–12, 2026.
- Seven state-of-the-art LLMs (open- and closed-source) were run under identical, timestamp-locked conditions to produce 36,165 predictions.
- Results show a major gap between confidence and financial usefulness: only two models delivered positive returns in the simulated order-book execution, with MiMo-V2-Flash (17.6% CWR) and Gemini-3-Flash (6.2% CWR) leading.
- The paper claims PolyBench provides a contamination-proof, financially grounded evaluation standard for future research into LLM forecasting and trading under real market uncertainty.
Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation
Reddit r/artificial

FastAPI With LangChain and MongoDB
Dev.to
Best AI Game Creator in 2026
Dev.to
![[Patterns] AI Agent Error Handling That Actually Works](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Frn5czaopq2vzo7cglady.png&w=3840&q=75)
[Patterns] AI Agent Error Handling That Actually Works
Dev.to

Building ONNX Embedding Workflows in Oracle AI Database with Python
Dev.to