A Judge Agent Closes the Reliability Gap in AI-Generated Scientific Simulation

arXiv cs.LG / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study addresses the high silent-failure rate of LLM-generated scientific simulation code by introducing a “Judge Agent” that automates classical validation via well-posedness, convergence, and error certification.
Across 134 test cases in 12 scientific domains, the silent-failure rate drops from 42% to 1.5%, with residual errors concentrated around bifurcation points where certifiability is harder.
A prospective benchmark using 72 blinded tasks submitted by 12 independent scientists reports an 89% success rate (95% CI: [80%, 95%]) with automated error bounds, compared with 53% without the Judge.
On a clinical CT experiment (the only powered study, n=200), the pipeline reaches 99% of expert-quality performance, suggesting strong reliability for real-world simulation workloads.
The authors formalize certifiability limits through a “simulability class S” framework and propose spec.md, a structured, machine-readable specification format, while publicly archiving code, data, and the full benchmark suite.

Abstract

Large language models can generate scientific simulation code, but the generated code silently fails on most non-textbook problems. We show that classical mathematical validation -- well-posedness, convergence, and error certification -- can be fully automated by a Judge Agent, reducing the silent-failure rate from 42% to 1.5% across 134 test cases spanning 12 scientific domains. The headline result comes from a prospective benchmark: 72 blinded tasks submitted by 12 independent scientists yield an 89% success rate (95% CI: [80%, 95%]) with automated error bounds, versus 53% without the Judge. On clinical CT (the only powered experiment, n = 200), the pipeline reaches 99% of expert quality. The residual 1.5% concentrates at bifurcation points where certifiability breaks down. We formalize this boundary through the simulability class S and introduce spec.md, a structured specification format that makes any scientific computation problem machine-readable and solver-independent. Code, data, and all 72 benchmark tasks are publicly archived.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

A Judge Agent Closes the Reliability Gap in AI-Generated Scientific Simulation

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer