SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design
arXiv cs.LG / 3/16/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- SciDesignBench is introduced as a benchmark of 520 simulator-grounded tasks across 14 scientific domains and five settings to evaluate inverse design from desired outcomes to inputs.
- On the 10-domain shared-core subset, the best zero-shot model achieves 29.0% success with higher parse rates, and simulator feedback influences performance, with the leaderboard depending on horizon (e.g., Sonnet 4.5 leading one-turn de novo design, Opus 4.6 leading after 20 turns).
- Providing a starting seed design reshuffles the leaderboard, illustrating that constrained modification requires capabilities distinct from unconstrained de novo generation.
- A simulator-feedback training recipe called RLSF is proposed; an 8B model tuned with RLSF raises single-turn success rates by 8–17 percentage points across three domains, highlighting potential to amortize test-time compute into model weights and establish simulator-grounded inverse design as both a scientific benchmark and practical tool.
Related Articles
I Was Wrong About AI Coding Assistants. Here's What Changed My Mind (and What I Built About It).
Dev.to

Interesting loop
Reddit r/LocalLLaMA
Qwen3.5-122B-A10B Uncensored (Aggressive) — GGUF Release + new K_P Quants
Reddit r/LocalLLaMA
Die besten AI Tools fuer Digital Nomads 2026
Dev.to
I Built the Most Feature-Complete MCP Server for Obsidian — Here's How
Dev.to