BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments

arXiv cs.AI / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces BeSafe-Bench (BSB), a new benchmark designed to uncover behavioral safety risks of situated agents operating in functional (high-fidelity) environments rather than low-fidelity simulations or narrow tasks.
BSB covers four domains—Web, Mobile, Embodied VLM, and Embodied VLA—and expands instruction sets by adding nine categories of safety-critical risks to tasks.
It uses a hybrid evaluation approach that combines rule-based checks with LLM-as-a-judge reasoning to assess how agents impact real environment outcomes.
Testing 13 popular agents shows a concerning pattern: even the best agents complete under 40% of tasks while fully satisfying safety constraints, and high task success often aligns with severe safety violations.

Abstract

The rapid evolution of Large Multimodal Models (LMMs) has enabled agents to perform complex digital and physical tasks, yet their deployment as autonomous decision-makers introduces substantial unintentional behavioral safety risks. However, the absence of a comprehensive safety benchmark remains a major bottleneck, as existing evaluations rely on low-fidelity environments, simulated APIs, or narrowly scoped tasks. To address this gap, we present BeSafe-Bench (BSB), a benchmark for exposing behavioral safety risks of situated agents in functional environments, covering four representative domains: Web, Mobile, Embodied VLM, and Embodied VLA. Using functional environments, we construct a diverse instruction space by augmenting tasks with nine categories of safety-critical risks, and adopt a hybrid evaluation framework that combines rule-based checks with LLM-as-a-judge reasoning to assess real environmental impacts. Evaluating 13 popular agents reveals a concerning trend: even the best-performing agent completes fewer than 40% of tasks while fully adhering to safety constraints, and strong task performance frequently coincides with severe safety violations. These findings underscore the urgent need for improved safety alignment before deploying agentic systems in real-world settings.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer