SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents
arXiv cs.AI / 4/15/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SIR-Bench, a benchmark with 794 test cases designed to evaluate autonomous security incident response agents on both triage correctness and investigation depth rather than simple alert repetition.
- SIR-Bench is built from 129 anonymized incident patterns and uses expert-validated ground truth to distinguish genuine forensic investigation from “alert parroting.”
- To generate realistic, measurable evaluation scenarios, the authors develop Once Upon A Threat (OUAT), which replays incident patterns in controlled cloud environments to produce authentic telemetry.
- The evaluation uses three complementary metrics—triage accuracy (M1), novel evidence discovery (M2), and tool usage appropriateness (M3)—scored with an adversarial LLM-as-Judge that requires concrete forensic evidence to credit investigations.
- In reported results, a tested SIR agent achieved 97.1% true positive detection, 73.4% false positive rejection, and averages of 5.67 novel key findings per case, establishing a baseline for future agents.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




