SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

arXiv cs.AI / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces SafetyALFRED, a safety-focused embodied-agent benchmark based on ALFRED and expanded with six categories of real-world kitchen hazards.
  • Unlike prior safety evaluations centered on hazard recognition in text-only QA settings, SafetyALFRED measures both recognition and proactive risk mitigation via embodied planning.
  • Testing eleven state-of-the-art multimodal LLMs (Qwen, Gemma, Gemini families) shows a notable alignment gap: strong hazard recognition does not translate into effective mitigation.
  • The authors argue that static QA-based evaluations are insufficient for physical safety and call for a shift toward benchmarks that prioritize corrective actions in real embodied environments.
  • The code and dataset are released open-source for further research and evaluation (GitHub: https://github.com/sled-group/SafetyALFRED.git).

Abstract

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git