Towards Optimal Agentic Architectures for Offensive Security Tasks

arXiv cs.AI / 4/22/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to choose agent coordination topologies for LLM-based security agents, addressing whether adding more agents improves results or merely increases cost.
  • It introduces a controlled benchmark with 20 interactive targets (web/API and binary), testing vulnerability detection in both whitebox and blackbox settings.
  • Across 600 core runs covering five architecture families, three model families, and two access modes, the study reports best validated detection with MAS-Indep at 64.2% and best efficiency with SAS at $0.058 per validated finding.
  • Results show strong performance gaps by observability and domain: whitebox greatly outperforms blackbox (67.0% vs. 32.7% validated detection) and web greatly outperforms binary (74.3% vs. 25.3%).
  • The findings suggest a non-monotonic cost–quality frontier, where broader coordination can raise coverage but does not always dominate after factoring in latency, token costs, and exploit-validation difficulty.

Abstract

Agentic security systems increasingly audit live targets with tool-using LLMs, but prior systems fix a single coordination topology, leaving unclear when additional agents help and when they only add cost. We treat topology choice as an empirical systems question. We introduce a controlled benchmark of 20 interactive targets (10 web/API and 10 binary), each exposing one endpoint-reachable ground-truth vulnerability, evaluated in whitebox and blackbox modes. The core study executes 600 runs over five architecture families, three model families, and both access modes, with a separate 60-run long-context pilot reported only in the appendix. On the completed core benchmark, detection-any reaches 58.0% and validated detection reaches 49.8%. MAS-Indep attains the highest validated detection rate (64.2%), while SAS is the strongest efficiency baseline at $0.058 per validated finding. Whitebox materially outperforms blackbox (67.0% vs. 32.7% validated detection), and web materially outperforms binary (74.3% vs. 25.3%). Bootstrap confidence intervals and paired target-level deltas show that the dominant effects are observability and domain, while some leading whitebox topologies remain statistically close. The main result is a non-monotonic cost-quality frontier: broader coordination can improve coverage, but it does not dominate once latency, token cost, and exploit-validation difficulty are taken into account.