How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

arXiv cs.CL / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper benchmarks how well “agentic skills” (reusable, domain-specific knowledge artifacts) improve LLM agent performance under increasingly realistic conditions, including scenarios where agents must retrieve skills from a large 34k collection rather than being given hand-crafted skills.
Results show that skill benefits are fragile: as realism increases and skill matching becomes less tailored, performance gains consistently degrade and can converge toward no-skill baselines in the hardest settings.
The study tests skill refinement strategies (query-specific vs. query-agnostic) and finds that query-specific refinement can substantially recover performance when the initially retrieved skills are reasonably relevant and high quality.
Using Terminal-Bench 2.0 as a demonstration, retrieval plus refinement increases Claude Opus 4.6 pass rate from 57.7% to 65.5%, suggesting the approach generalizes beyond a single benchmark.
Findings across multiple models indicate both promise and current limitations of skill-based augmentation, and the authors release code for reproducibility.

Abstract

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

OpenAI vs Anthropic IPO Finances Compared — The 2026 AI Mega IPO Race

Dev.to

Prompt Engineering in 2026: Advanced Techniques for Better AI Results

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Ace Step 1.5 XL Models Available

Reddit r/LocalLLaMA

Mistral Small 4: The All-in-One Model Simplifying AI for E-commerce Merchants

Dev.to

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Key Points

Abstract

Related Articles

OpenAI vs Anthropic IPO Finances Compared — The 2026 AI Mega IPO Race

Prompt Engineering in 2026: Advanced Techniques for Better AI Results

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Ace Step 1.5 XL Models Available

Mistral Small 4: The All-in-One Model Simplifying AI for E-commerce Merchants

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer