SkillTester: Benchmarking Utility and Security of Agent Skills

arXiv cs.AI / 4/1/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The arXiv paper introduces SkillTester, a benchmarking tool designed to evaluate both the utility and the security of agent skills.
Its framework uses paired baseline vs with-skill execution setups combined with a separate security probe suite to measure performance and safety differences.
Results are normalized into a utility score, a security score, and a three-level security status label to make comparisons more consistent and interpretable.
The project is presented as a quality-assurance harness for agent-first systems, with a deployed public service (skilltester.ai) and an associated GitHub repository for ongoing maintenance.

Abstract

This technical report presents SkillTester, a tool for evaluating the utility and security of agent skills. Its evaluation framework combines paired baseline and with-skill execution conditions with a separate security probe suite. Grounded in a comparative utility principle and a user-facing simplicity principle, the framework normalizes raw execution artifacts into a utility score, a security score, and a three-level security status label. More broadly, it can be understood as a comparative quality-assurance harness for agent skills in an agent-first world. The public service is deployed at https://skilltester.ai, and the broader project is maintained at https://github.com/skilltester-ai/skilltester.