SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

arXiv cs.AI / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper argues that evaluating LLM agents for scientific tasks should account not only for token costs but also for tool-use costs such as simulation time and experimental resources, since common metrics like pass@k fail under realistic budgets.
  • It introduces SimulCost, a cost-aware benchmark and open-source toolkit for physics simulations, covering 2,916 single-round initial-guess tasks and 1,900 multi-round trial-and-error adjustment tasks across 12 simulators in fluid dynamics, solid mechanics, and plasma physics.
  • The study uses analytically defined, platform-independent cost models per simulator to compare LLM-driven cost-sensitive parameter tuning against traditional scanning in both accuracy and computational expense.
  • Results show frontier LLMs achieve 46–64% success in single-round mode (dropping to 35–54% for high accuracy), while multi-round improves to 71–80% but is 1.5–2.5× slower than scanning, making LLM approaches potentially uneconomical despite accuracy gains.
  • The authors further analyze parameter group correlations for knowledge transfer, and evaluate how in-context examples and reasoning effort affect performance, publishing code and data to enable extensions to new simulation environments.

Abstract

Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46--64% success rates in single-round mode, dropping to 35--54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi-round mode improves rates to 71--80%, but LLMs are 1.5--2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in-context examples and reasoning effort, providing practical implications for deployment and fine-tuning. We open-source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost-aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at https://github.com/Rose-STL-Lab/SimulCost-Bench.