QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

arXiv cs.LG / 4/20/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that current LLM benchmarks for uncertainty are too limited to judgment-style tasks (e.g., binary/multiple-choice) and don’t capture real forecasting needs involving continuous numerical quantities.
It proposes prediction intervals as a rigorous evaluation interface because they require models to express uncertainty, maintain internal consistency across confidence levels, and achieve calibration over a continuum of outcomes.
The authors introduce QuantSightBench, a new benchmark for LLM quantitative forecasting, and evaluate 11 frontier and open-weight models using metrics like empirical coverage and interval sharpness.
Results show no evaluated model reaches the 90% coverage target, with leading models (Gemini 3.1 Pro at 79.1%, Grok 4 at 76.4%, GPT-5.4 at 75.3%) missing by at least 10 percentage points.
The study finds calibration worsens sharply at extreme magnitudes, indicating systematic overconfidence across the evaluated models.

Abstract

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

Space now with memory

Dev.to

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Space now with memory

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer