Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges
arXiv cs.AI / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper introduces DeepRed, an open-source benchmark to evaluate LLM agents on realistic, VM-based Capture The Flag (CTF) cybersecurity challenges in isolated environments.
- DeepRed runs agents in a Kali attacker setup with terminal tools and optional web search, connects them to target challenges over a private network, and collects full execution traces for later analysis.
- Instead of only using solved/unsolved outcomes, it proposes a partial-credit scoring approach using challenge-specific checkpoints from public writeups, with an automated summarize-then-judge pipeline to label checkpoint completion from logs.
- Using DeepRed, the authors benchmark 10 commercially accessible LLMs across 10 CTF challenges and find agents are still limited, with the best model averaging only 35% checkpoint completion.
- Performance varies by challenge type, with stronger results on common formats and weaker results on tasks needing non-standard discovery and longer-horizon adaptation.
Related Articles
I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.
Reddit r/artificial
Deepseek V4 Flash and Non-Flash Out on HuggingFace
Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API
Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means
Dev.to

r/LocalLLaMa Rule Updates
Reddit r/LocalLLaMA