BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments

arXiv cs.AI / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces BeSafe-Bench (BSB), a new benchmark designed to uncover behavioral safety risks of situated agents operating in functional (high-fidelity) environments rather than low-fidelity simulations or narrow tasks.
  • BSB covers four domains—Web, Mobile, Embodied VLM, and Embodied VLA—and expands instruction sets by adding nine categories of safety-critical risks to tasks.
  • It uses a hybrid evaluation approach that combines rule-based checks with LLM-as-a-judge reasoning to assess how agents impact real environment outcomes.
  • Testing 13 popular agents shows a concerning pattern: even the best agents complete under 40% of tasks while fully satisfying safety constraints, and high task success often aligns with severe safety violations.

Abstract

The rapid evolution of Large Multimodal Models (LMMs) has enabled agents to perform complex digital and physical tasks, yet their deployment as autonomous decision-makers introduces substantial unintentional behavioral safety risks. However, the absence of a comprehensive safety benchmark remains a major bottleneck, as existing evaluations rely on low-fidelity environments, simulated APIs, or narrowly scoped tasks. To address this gap, we present BeSafe-Bench (BSB), a benchmark for exposing behavioral safety risks of situated agents in functional environments, covering four representative domains: Web, Mobile, Embodied VLM, and Embodied VLA. Using functional environments, we construct a diverse instruction space by augmenting tasks with nine categories of safety-critical risks, and adopt a hybrid evaluation framework that combines rule-based checks with LLM-as-a-judge reasoning to assess real environmental impacts. Evaluating 13 popular agents reveals a concerning trend: even the best-performing agent completes fewer than 40% of tasks while fully adhering to safety constraints, and strong task performance frequently coincides with severe safety violations. These findings underscore the urgent need for improved safety alignment before deploying agentic systems in real-world settings.