EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current benchmarks for LLM agents focus mainly on whether downstream tasks succeed, overlooking the quality risks of the tool libraries those agents generate at runtime.
It introduces EvolveTool-Bench, a benchmark that evaluates LLM-generated tool libraries using library-level metrics (e.g., reuse, redundancy, composition success, regression stability, and safety) and per-tool Tool Quality Scores (e.g., correctness, robustness, generality, and code quality).
Across three execution-dependent domains—proprietary data formats, API orchestration, and numerical computation—the authors show how tool libraries can vary in health even when task completion rates are similar.
In a head-to-head comparison (ARISE vs. EvoSkill vs. one-shot baselines) over 99 tasks with two models, systems with comparable task completion (63–68%) can differ by up to 18% in library health, highlighting the limitations of task-only evaluation.
The work concludes that evaluating and governing evolving, LLM-generated tools should treat the tool library as a first-class software artifact rather than a black box.

Abstract

Modern LLM agents increasingly create their own tools at runtime -- from Python functions to API clients -- yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics -- reuse, redundancy, composition success, regression stability, and safety -- alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE vs. EvoSkill vs. one-shot baselines, 99 tasks, two models), we show that systems with similar task completion (63-68%) differ by up to 18% in library health, revealing software quality risks invisible to task-only evaluation. Our results highlight that evaluation and governance of LLM-generated tools require treating the evolving tool library as a first-class software artifact, not a black box.

Black Hat Asia

AI Business

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

Dev.to

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Dev.to

Why the same codebase should always produce the same audit score

Dev.to

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

Dev.to

EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

Key Points

Abstract

Related Articles

Black Hat Asia

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Why the same codebase should always produce the same audit score

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer