STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems

arXiv cs.AI / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

自律型LLMエージェントが「スキル/ツール」を呼び出す際、静的監査だけでは“その呼び出し”が要求内容と実行コンテキストに照らして危険かを判断できない問題に取り組む研究である。
STARSは、静的な能力事前分布に加えて「ユーザ要求＋実行コンテキスト条件付き」の呼び出しリスク推定モデルと、キャリブレーションされたリスク融合ポリシーを組み合わせ、介入前のランキング/トリアージを可能にする。
評価のために、3,000件のスキル呼び出し記録からなるSIA-Benchを構築し、グループ安全分割、ラインエージメタデータ、ランタイム文脈、行動ラベル、連続的リスク目標などを提供する。
間接的なプロンプトインジェクション攻撃のオフライン評価では、キャリブレーション融合が高リスクAUPRCで0.439を達成し、複数のベースラインより改善しつつ、期待キャリブレーション誤差でもより良い校正を示した。

Abstract

Autonomous language-model agents increasingly rely on installable skills and tools to complete user tasks. Static skill auditing can expose capability surface before deployment, but it cannot determine whether a particular invocation is unsafe under the current user request and runtime context. We therefore study skill invocation auditing as a continuous-risk estimation problem: given a user request, candidate skill, and runtime context, predict a score that supports ranking and triage before a hard intervention is applied. We introduce STARS, which combines a static capability prior, a request-conditioned invocation risk model, and a calibrated risk-fusion policy. To evaluate this setting, we construct SIA-Bench, a benchmark of 3,000 invocation records with group-safe splits, lineage metadata, runtime context, canonical action labels, and derived continuous-risk targets. On a held-out split of indirect prompt injection attacks, calibrated fusion reaches 0.439 high-risk AUPRC, improving over 0.405 for the contextual scorer and 0.380 for the strongest static baseline, while the contextual scorer remains better calibrated with 0.289 expected calibration error. On the locked in-distribution test split, gains are smaller and static priors remain useful. The resulting claim is therefore narrower: request-conditioned auditing is most valuable as an invocation-time risk-scoring and triage layer rather than as a replacement for static screening. Code is available at https://github.com/123zgj123/STARS.