Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces DeepRed, an open-source benchmark to evaluate LLM agents on realistic, VM-based Capture The Flag (CTF) cybersecurity challenges in isolated environments.
DeepRed runs agents in a Kali attacker setup with terminal tools and optional web search, connects them to target challenges over a private network, and collects full execution traces for later analysis.
Instead of only using solved/unsolved outcomes, it proposes a partial-credit scoring approach using challenge-specific checkpoints from public writeups, with an automated summarize-then-judge pipeline to label checkpoint completion from logs.
Using DeepRed, the authors benchmark 10 commercially accessible LLMs across 10 CTF challenges and find agents are still limited, with the best model averaging only 35% checkpoint completion.
Performance varies by challenge type, with stronger results on common formats and weaker results on tasks needing non-standard discovery and longer-horizon adaptation.

Abstract

Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

r/LocalLLaMa Rule Updates

Reddit r/LocalLLaMA

Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

Key Points

Abstract

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

r/LocalLLaMa Rule Updates

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer