Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces DeepRed, an open-source benchmark to evaluate LLM agents on realistic, VM-based Capture The Flag (CTF) cybersecurity challenges in isolated environments.
  • DeepRed runs agents in a Kali attacker setup with terminal tools and optional web search, connects them to target challenges over a private network, and collects full execution traces for later analysis.
  • Instead of only using solved/unsolved outcomes, it proposes a partial-credit scoring approach using challenge-specific checkpoints from public writeups, with an automated summarize-then-judge pipeline to label checkpoint completion from logs.
  • Using DeepRed, the authors benchmark 10 commercially accessible LLMs across 10 CTF challenges and find agents are still limited, with the best model averaging only 35% checkpoint completion.
  • Performance varies by challenge type, with stronger results on common formats and weaker results on tasks needing non-standard discovery and longer-horizon adaptation.

Abstract

Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.