SafeAgent: A Runtime Protection Architecture for Agentic Systems

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that LLM agents are highly vulnerable to prompt-injection attacks that can spread across multi-step workflows, tool use, and persistent context, so simple input-output filtering is not enough.
  • It introduces SafeAgent, a runtime security architecture that frames agent safety as a stateful decision problem across evolving interaction trajectories.
  • SafeAgent separates action execution governance (via a runtime controller) from semantic risk reasoning (via a context-aware decision core that uses persistent session state).
  • The decision core is formalized with context-aware advanced machine intelligence and built from components for risk encoding, utility–cost evaluation, consequence modeling, policy arbitration, and state synchronization.
  • Experiments on Agent Security Bench (ASB) and InjecAgent show improved robustness versus baseline and text-level guardrails, with ablations indicating that recovery confidence and policy weighting create different safety–utility trade-offs.

Abstract

Large language model (LLM) agents are vulnerable to prompt-injection attacks that propagate through multi-step workflows, tool interactions, and persistent context, making input-output filtering alone insufficient for reliable protection. This paper presents SafeAgent, a runtime security architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories. The proposed design separates execution governance from semantic risk reasoning through two coordinated components: a runtime controller that mediates actions around the agent loop and a context-aware decision core that operates over persistent session state. The core is formalized as a context-aware advanced machine intelligence and instantiated through operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench (ASB) and InjecAgent show that SafeAgent consistently improves robustness over baseline and text-level guardrail methods while maintaining competitive benign-task performance. Ablation studies further show that recovery confidence and policy weighting determine distinct safety-utility operating points.