Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

arXiv cs.AI / 4/8/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

Claw-Eval is introduced as an end-to-end evaluation suite for autonomous agents, targeting gaps in agent benchmarks around trajectory visibility, safety/robustness coverage, and modality breadth.
The suite covers 300 human-verified tasks across nine categories and three interaction settings (service orchestration, multimodal perception/generation, and multi-turn professional dialogue), with 2,159 fine-grained rubric items.
It records every agent action via three independent evidence channels—execution traces, audit logs, and environment snapshots—to enable trajectory-aware grading rather than only checking final outputs.
Scoring evaluates Completion, Safety, and Robustness using metrics like Average Score, Pass@k, and Pass^k across three trials to reduce the chance of “lucky” pass outcomes.
Experiments on 14 frontier models show that existing trajectory-opaque evaluation misses 44% of safety violations and 13% of robustness failures, multimodal performance varies significantly (often worse on video), and error injection mainly harms consistency rather than top-end capability.

Abstract

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Efficient Inference with SGLang: Text and Image Generation

The Batch

Meta's latest model is as open as Zuckerberg's private school

The Register

I Have an AI Agent That Tests My Own Product Every 3 Hours

Dev.to

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Efficient Inference with SGLang: Text and Image Generation

Meta's latest model is as open as Zuckerberg's private school

I Have an AI Agent That Tests My Own Product Every 3 Hours

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer