Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

arXiv cs.CL / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies ClawHub, a large public registry of LLM agent “skills,” by building and normalizing a dataset of 26,502 skills and analyzing language, organization, popularity, and security-related signals.
  • It finds strong cross-lingual patterns: English skills skew toward infrastructure and technical capabilities (e.g., APIs, automation, memory), while Chinese skills cluster more around application scenarios such as media generation, social content, and finance services.
  • The authors report that over 30% of crawled skills show suspicious or malicious labeling via available platform signals, and many skills still lack complete safety observability.
  • They propose an early risk-assessment approach using only submission-time information and evaluate a balanced benchmark of 11,010 skills, with the best Logistic Regression reaching 72.62% accuracy and 78.95% AUROC.
  • Documentation quality is identified as the most informative submission-time signal for predicting skill risk, highlighting public registries as both an enabler for reuse and a new security risk surface.

Abstract

Skill ecosystems have emerged as an increasingly important layer in Large Language Model (LLM) agent systems, enabling reusable task packaging, public distribution, and community-driven capability sharing. However, despite their rapid growth, the functionality, ecosystem structure, and security risks of public skill registries remain underexplored. In this paper, we present an empirical study of ClawHub, a large public registry of agent skills. We build and normalize a dataset of 26,502 skills, and conduct a systematic analysis of their language distribution, functional organization, popularity, and security signals. Our clustering results show clear cross-lingual differences: English skills are more infrastructure-oriented and centered on technical capabilities such as APIs, automation, and memory, whereas Chinese skills are more application-oriented, with clearer scenario-driven clusters such as media generation, social content production, and finance-related services. We further find that more than 30% of all crawled skills are labeled as suspicious or malicious by available platform signals, while a substantial fraction of skills still lack complete safety observability. To study early risk assessment, we formulate submission-time skill risk prediction using only information available at publication time, and construct a balanced benchmark of 11,010 skills. Across 12 classifiers, the best Logistic Regression achieves a accuracy of 72.62% and an AUROC of 78.95%, with primary documentation emerging as the most informative submission-time signal. Our findings position public skill registries as both a key enabler of agent capability reuse and a new surface for ecosystem-scale security risk.