SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents

arXiv cs.AI / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SIR-Bench, a benchmark with 794 test cases designed to evaluate autonomous security incident response agents on both triage correctness and investigation depth rather than simple alert repetition.
SIR-Bench is built from 129 anonymized incident patterns and uses expert-validated ground truth to distinguish genuine forensic investigation from “alert parroting.”
To generate realistic, measurable evaluation scenarios, the authors develop Once Upon A Threat (OUAT), which replays incident patterns in controlled cloud environments to produce authentic telemetry.
The evaluation uses three complementary metrics—triage accuracy (M1), novel evidence discovery (M2), and tool usage appropriateness (M3)—scored with an adversarial LLM-as-Judge that requires concrete forensic evidence to credit investigations.
In reported results, a tested SIR agent achieved 97.1% true positive detection, 73.4% false positive rejection, and averages of 5.67 novel key findings per case, establishing a baseline for future agents.

Abstract

We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only whether agents reach correct triage decisions, but whether they discover novel evidence through active investigation. To construct SIR-Bench, we develop Once Upon A Threat (OUAT), a framework that replays real incident patterns in controlled cloud environments, producing authentic telemetry with measurable investigation outcomes. Our evaluation methodology introduces three complementary metrics: triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3), assessed through an adversarial LLM-as-Judge that inverts the burden of proof -- requiring concrete forensic evidence to credit investigations. Evaluating our SIR agent on the benchmark demonstrates 97.1% true positive (TP) detection, 73.4% false positive (FP) rejection, and 5.67 novel key findings per case, establishing a baseline against which future investigation agents can be measured.