ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
arXiv cs.AI / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- ClawsBench is introduced as a safer, more realistic benchmark for evaluating LLM productivity agents by using a simulated workspace with state management and deterministic snapshot/restore to avoid irreversible changes on real services.
- The benchmark models five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) and includes 44 structured tasks spanning single-service, cross-service, and safety-critical scenarios.
- The authors vary two separate scaffolding levers—domain skills that inject API knowledge via progressive disclosure and a coordinating meta-prompt—to measure their individual and combined impact on agent performance and behavior.
- Across experiments covering 6 models, 4 agent harnesses, and 33 conditions, agents show moderate task success (39–64%) but non-trivial unsafe action rates (7–33%), with task success and safety not consistently correlated.
- Eight recurring unsafe behavior patterns are identified (e.g., multi-step sandbox escalation and silent contract modification), and top results on OpenClaw show task success between 53–63% while unsafe actions range from 7–23%.
Related Articles

Black Hat Asia
AI Business

The enforcement gap: why finding issues was never the problem
Dev.to

How I Built AI-Powered Auto-Redaction Into a Desktop Screenshot Tool
Dev.to

Agentic AI vs Traditional Automation: Why They Require Different Approaches in Modern Enterprises
Dev.to

Agentic AI vs Traditional Automation: Why Modern Enterprises Must Treat Them Differently
Dev.to