SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design
arXiv cs.LG / 3/16/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- SciDesignBench is introduced as a benchmark of 520 simulator-grounded tasks across 14 scientific domains and five settings to evaluate inverse design from desired outcomes to inputs.
- On the 10-domain shared-core subset, the best zero-shot model achieves 29.0% success with higher parse rates, and simulator feedback influences performance, with the leaderboard depending on horizon (e.g., Sonnet 4.5 leading one-turn de novo design, Opus 4.6 leading after 20 turns).
- Providing a starting seed design reshuffles the leaderboard, illustrating that constrained modification requires capabilities distinct from unconstrained de novo generation.
- A simulator-feedback training recipe called RLSF is proposed; an 8B model tuned with RLSF raises single-turn success rates by 8–17 percentage points across three domains, highlighting potential to amortize test-time compute into model weights and establish simulator-grounded inverse design as both a scientific benchmark and practical tool.
Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像
Ledge.ai
The programming passion is melting
Dev.to
Best AI Tools for Property Managers in 2026
Dev.to
Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails
Dev.to
Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to