AI Navigate

LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories

arXiv cs.AI / 3/13/2026

📰 NewsModels & Research

Key Points

  • LABSHIELD introduces a realistic multimodal benchmark to assess MLLMs in hazard identification and safety-critical reasoning within scientific laboratories, grounded in OSHA standards and GHS classifications.
  • It spans 164 operational tasks with diverse manipulation complexities and risk profiles to enable rigorous safety evaluation across lab scenarios.
  • The evaluation covers 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track framework, highlighting the performance gap between general-domain MCQ accuracy and safety-oriented QA.
  • The results show an average 32.0% drop in professional-lab safety performance, particularly in hazard interpretation and safety-aware planning, underscoring the need for safety-centric reasoning in embodied lab AI.
  • The full dataset will be released soon, signaling a new resource for advancing safe autonomous experimentation in labs.

Abstract

Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.