Using large language models for embodied planning introduces systematic safety risks

arXiv cs.RO / 4/21/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DESPITE, a deterministic benchmark with 12,279 robotic planning tasks covering both physical and normative dangers to systematically evaluate safety.
  • Results across 23 large language models show a critical mismatch: even a near-perfect planner still generates dangerous plans on 28.3% of tasks despite only failing to produce a valid plan on 0.4%.
  • Among 18 open-source models (3B–671B parameters), planning success scales strongly with model size (0.4–99.3%), while safety awareness stays relatively flat (38–57%).
  • The authors find a multiplicative relationship between planning ability and safety, suggesting that larger models mainly improve safe task completion through better planning rather than superior danger avoidance.
  • Proprietary reasoning models achieve higher safety awareness (71–81%), but as planning capability saturates for frontier models, improving safety awareness becomes the key remaining challenge for deploying LLM-based planners in robotics.

Abstract

Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.