MolViBench: Evaluating LLMs on Molecular Vibe Coding

arXiv cs.CL / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Molecular Vibe Coding is described as a workflow paradigm where chemists work with LLMs to generate executable programs for molecular tasks, offering flexibility beyond tool-constrained chemical agents.
  • The article argues that existing benchmarks are insufficient because general coding datasets lack chemistry reasoning, while chemistry benchmarks typically focus on recall or property prediction rather than executable code generation.
  • It introduces MolViBench, the first benchmark specifically designed for Molecular Vibe Coding, featuring 358 curated tasks across five cognitive levels and 12 real-world drug discovery workflows.
  • A multi-layer evaluation framework is proposed to judge generated code both for executability (via type-aware comparisons) and chemical correctness (via AST-based API-semantic fallback analysis).
  • The benchmark is used to evaluate nine leading coding LLMs and to compare three real-world Molecular Vibe Coding paradigms, aiming to diagnose model strengths and weaknesses for AI-accelerated molecular discovery.

Abstract

Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected. General code generation benchmarks such as HumanEval and SWE-bench require no chemistry knowledge, while chemistry-focused benchmarks such as S^2-Bench and ChemCoTBench evaluate knowledge recall or property prediction rather than executable code generation. To bridge this gap, we introduce MolViBench, the first benchmark tailored for Molecular Vibe Coding. MolViBench comprises 358 curated tasks across five cognitive levels, ranging from single-API recall to end-to-end virtual screening pipeline design, spanning 12 real-world drug discovery workflows. To rigorously assess generated code, we also propose a multi-layered evaluation framework that combines type-aware output comparison and AST-based API-semantic fallback analysis, which jointly measures executability and chemical correctness. We systematically evaluate 9 frontier coding LLMs and compare three real-world Molecular Vibe Coding paradigms, providing a practical and fine-grained testbed for diagnosing LLMs' coding capabilities in AI-accelerated molecular discovery.