NASimJax: GPU-Accelerated Policy Learning Framework for Penetration Testing

arXiv cs.LG / 3/23/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • NASimJax is a GPU-accelerated, JAX-based reimplementation of NASim that achieves up to 100x higher environment throughput, enabling reinforcement learning training on larger network scenarios.
  • The work formulates automated penetration testing as a Contextual POMDP and introduces a network generation pipeline that yields structurally diverse and guaranteed-solvable scenarios to study zero-shot generalization.
  • It introduces a two-stage action decomposition (2SAS) to handle linearly growing action spaces and shows this approach substantially outperforms flat action masking at scale.
  • The paper analyzes interactions between Prioritized Level Replay and 2SAS, identifies a failure mode related to their credit-assignment dynamics, and demonstrates that NASimJax provides a fast, flexible platform for advancing RL-based penetration testing.

Abstract

Penetration testing, the practice of simulating cyberattacks to identify vulnerabilities, is a complex sequential decision-making task that is inherently partially observable and features large action spaces. Training reinforcement learning (RL) policies for this domain faces a fundamental bottleneck: existing simulators are too slow to train on realistic network scenarios at scale, resulting in policies that fail to generalize. We present NASimJax, a complete JAX-based reimplementation of the Network Attack Simulator (NASim), achieving up to 100x higher environment throughput than the original simulator. By running the entire training pipeline on hardware accelerators, NASimJax enables experimentation on larger networks under fixed compute budgets that were previously infeasible. We formulate automated penetration testing as a Contextual POMDP and introduce a network generation pipeline that produces structurally diverse and guaranteed-solvable scenarios. Together, these provide a principled basis for studying zero-shot policy generalization. We use the framework to investigate action-space scaling and generalization across networks of up to 40 hosts. We find that Prioritized Level Replay better handles dense training distributions than Domain Randomization, particularly at larger scales, and that training on sparser topologies yields an implicit curriculum that improves out-of-distribution generalization, even on topologies denser than those seen during training. To handle linearly growing action spaces, we propose a two-stage action decomposition (2SAS) that substantially outperforms flat action masking at scale. Finally, we identify a failure mode arising from the interaction between Prioritized Level Replay's episode-reset behaviour and 2SAS's credit assignment structure. NASimJax thus provides a fast, flexible, and realistic platform for advancing RL-based penetration testing.