Towards Optimal Agentic Architectures for Offensive Security Tasks

arXiv cs.AI / 4/22/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how to choose agent coordination topologies for LLM-based security agents, addressing whether adding more agents improves results or merely increases cost.
It introduces a controlled benchmark with 20 interactive targets (web/API and binary), testing vulnerability detection in both whitebox and blackbox settings.
Across 600 core runs covering five architecture families, three model families, and two access modes, the study reports best validated detection with MAS-Indep at 64.2% and best efficiency with SAS at $0.058 per validated finding.
Results show strong performance gaps by observability and domain: whitebox greatly outperforms blackbox (67.0% vs. 32.7% validated detection) and web greatly outperforms binary (74.3% vs. 25.3%).
The findings suggest a non-monotonic cost–quality frontier, where broader coordination can raise coverage but does not always dominate after factoring in latency, token costs, and exploit-validation difficulty.

Abstract

Agentic security systems increasingly audit live targets with tool-using LLMs, but prior systems fix a single coordination topology, leaving unclear when additional agents help and when they only add cost. We treat topology choice as an empirical systems question. We introduce a controlled benchmark of 20 interactive targets (10 web/API and 10 binary), each exposing one endpoint-reachable ground-truth vulnerability, evaluated in whitebox and blackbox modes. The core study executes 600 runs over five architecture families, three model families, and both access modes, with a separate 60-run long-context pilot reported only in the appendix. On the completed core benchmark, detection-any reaches 58.0% and validated detection reaches 49.8%. MAS-Indep attains the highest validated detection rate (64.2%), while SAS is the strongest efficiency baseline at $0.058 per validated finding. Whitebox materially outperforms blackbox (67.0% vs. 32.7% validated detection), and web materially outperforms binary (74.3% vs. 25.3%). Bootstrap confidence intervals and paired target-level deltas show that the dominant effects are observability and domain, while some leading whitebox topologies remain statistically close. The main result is a non-monotonic cost-quality frontier: broader coordination can improve coverage, but it does not dominate once latency, token cost, and exploit-validation difficulty are taken into account.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/22DailyView insight →

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

Dev.to

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Dev.to

Salesforce Headless 360: Run Your CRM Without a Browser

Dev.to

RAG Systems in Production: Building Enterprise Knowledge Search

Dev.to

What Is the Difference Between Native and Cross-Platform App Development in 2026?

Dev.to

Towards Optimal Agentic Architectures for Offensive Security Tasks

Key Points

Abstract

💡 Insights using this article

Related Articles

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Salesforce Headless 360: Run Your CRM Without a Browser

RAG Systems in Production: Building Enterprise Knowledge Search

What Is the Difference Between Native and Cross-Platform App Development in 2026?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer