AI Navigate

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

arXiv cs.AI / 3/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges, a 32-step corporate network attack and a 7-step industrial control system attack, requiring long, multi-step capability chaining.
  • It documents two trends: performance scales log-linearly with inference-time compute (10M to 100M tokens yielding up to 59% gains) with no plateau and no need for operator-specific sophistication.
  • It shows that newer model generations outperform predecessors at fixed token budgets, with corporate-range progress from 1.7 steps (GPT-4o, Aug 2024) to 9.8 steps (Opus 4.6, Feb 2026), and a best run of 22/32 steps (~6 of 14 hours).
  • On the industrial control system range, progress is more limited but the latest models reliably complete some steps, averaging 1.2-1.4 of 7 steps (max 3).

Abstract

We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need. On the industrial control system range, performance remains limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 (max 3).