A Judge Agent Closes the Reliability Gap in AI-Generated Scientific Simulation

arXiv cs.LG / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study addresses the high silent-failure rate of LLM-generated scientific simulation code by introducing a “Judge Agent” that automates classical validation via well-posedness, convergence, and error certification.
  • Across 134 test cases in 12 scientific domains, the silent-failure rate drops from 42% to 1.5%, with residual errors concentrated around bifurcation points where certifiability is harder.
  • A prospective benchmark using 72 blinded tasks submitted by 12 independent scientists reports an 89% success rate (95% CI: [80%, 95%]) with automated error bounds, compared with 53% without the Judge.
  • On a clinical CT experiment (the only powered study, n=200), the pipeline reaches 99% of expert-quality performance, suggesting strong reliability for real-world simulation workloads.
  • The authors formalize certifiability limits through a “simulability class S” framework and propose spec.md, a structured, machine-readable specification format, while publicly archiving code, data, and the full benchmark suite.

Abstract

Large language models can generate scientific simulation code, but the generated code silently fails on most non-textbook problems. We show that classical mathematical validation -- well-posedness, convergence, and error certification -- can be fully automated by a Judge Agent, reducing the silent-failure rate from 42% to 1.5% across 134 test cases spanning 12 scientific domains. The headline result comes from a prospective benchmark: 72 blinded tasks submitted by 12 independent scientists yield an 89% success rate (95% CI: [80%, 95%]) with automated error bounds, versus 53% without the Judge. On clinical CT (the only powered experiment, n = 200), the pipeline reaches 99% of expert quality. The residual 1.5% concentrates at bifurcation points where certifiability breaks down. We formalize this boundary through the simulability class S and introduce spec.md, a structured specification format that makes any scientific computation problem machine-readable and solver-independent. Code, data, and all 72 benchmark tasks are publicly archived.