Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

arXiv cs.AI / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper presents a new benchmark dataset of 270 harmful instructions across nine prohibited medical-ethics behavior categories to test LLM safety for robotic health attendant control.
Evaluating 72 LLMs in simulation shows a high average violation rate of 54.4%, with over half the models exceeding 50% and significant differences by behavior type.
Subtle, plausibly worded harmful instructions (e.g., device manipulation and emergency delays) were harder for models to refuse than clearly destructive ones, indicating safety failure in realistic scenarios.
For open-weight models, larger model size and more recent release date were the main predictors of better safety performance, while proprietary models were much safer (median 23.7%) than open-weight models (median 72.8%).
Medical-domain fine-tuning did not yield meaningful overall safety improvements, and prompt-based defenses only modestly reduced violations for the least safe models—still too high for safe clinical deployment, emphasizing safety as a first-class requirement.

Abstract

Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/30DailyView insight →

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

Key Points

Abstract

💡 Insights using this article

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer