Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
arXiv cs.AI / 4/8/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- Claw-Eval is introduced as an end-to-end evaluation suite for autonomous agents, targeting gaps in agent benchmarks around trajectory visibility, safety/robustness coverage, and modality breadth.
- The suite covers 300 human-verified tasks across nine categories and three interaction settings (service orchestration, multimodal perception/generation, and multi-turn professional dialogue), with 2,159 fine-grained rubric items.
- It records every agent action via three independent evidence channels—execution traces, audit logs, and environment snapshots—to enable trajectory-aware grading rather than only checking final outputs.
- Scoring evaluates Completion, Safety, and Robustness using metrics like Average Score, Pass@k, and Pass^k across three trials to reduce the chance of “lucky” pass outcomes.
- Experiments on 14 frontier models show that existing trajectory-opaque evaluation misses 44% of safety violations and 13% of robustness failures, multimodal performance varies significantly (often worse on video), and error injection mainly harms consistency rather than top-end capability.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



