LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • LABBench2は、AIが生物研究で「意味のある仕事」を遂行できる実世界能力を測るために設計された、LAB-Benchの進化版ベンチマークである。
  • LABBench2は約1,900のタスクで構成され、ほとんどがLAB-Benchの継続でありつつ、より現実的な文脈で同種の能力を評価するように拡張されている。
  • 最先端モデルを評価した結果、LAB-Bench/LABBench2で性能は大きく向上した一方、LABBench2は難易度を実質的に引き上げており、サブタスク間ではモデル別精度差が-26%〜-46%の範囲で縮まらないギャップを示した。
  • 研究コミュニティの利用を促すため、タスクデータセット(Hugging Face)と公開評価ハーネス(GitHub)を提供している。

Abstract

Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.