Using large language models for embodied planning introduces systematic safety risks
arXiv cs.RO / 4/21/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces DESPITE, a deterministic benchmark with 12,279 robotic planning tasks covering both physical and normative dangers to systematically evaluate safety.
- Results across 23 large language models show a critical mismatch: even a near-perfect planner still generates dangerous plans on 28.3% of tasks despite only failing to produce a valid plan on 0.4%.
- Among 18 open-source models (3B–671B parameters), planning success scales strongly with model size (0.4–99.3%), while safety awareness stays relatively flat (38–57%).
- The authors find a multiplicative relationship between planning ability and safety, suggesting that larger models mainly improve safe task completion through better planning rather than superior danger avoidance.
- Proprietary reasoning models achieve higher safety awareness (71–81%), but as planning capability saturates for frontier models, improving safety awareness becomes the key remaining challenge for deploying LLM-based planners in robotics.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Capsule Security Emerges From Stealth With $7 Million in Funding
Dev.to

Rethinking Coding Education for the AI Era
Dev.to

We Shipped an MVP With Vibe-Coding. Here's What Nobody Tells You About the Aftermath
Dev.to

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents
Dev.to

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work
Dev.to