SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

arXiv cs.AI / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SciPredictを提案し、物理・生物・化学の33分野から集めた405タスクで、LLMが実験結果を事前にどれだけ正確に予測できるかを検証しています。
評価の結果、モデル精度は14〜26%にとどまり、信頼できる実験ガイダンスに必要な水準には達していないと報告されています。
予測の「信頼性の見分け」にも課題があり、モデルは自信度や「実験なしで予測可能」と判断しても、精度が一貫して約20%程度に留まる傾向が示されています。
一方で人間の専門家はキャリブレーションが良く、「予測可能」と判断した場合に精度が約5%から約80%へ大きく上がることが示され、信頼性認識の重要性が強調されています。
データとコードは公開され、実験プロセスに予測を組み込むには「当てる」だけでなく「予測の信頼性を認識する」ことが不可欠だという枠組みを提示しています。

Abstract

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is

\approx

20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only

\approx

20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from

\approx

5% to

\approx

80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Key Points

Abstract

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer