Session Risk Memory (SRM): Temporal Authorization for Deterministic Pre-Execution Safety Gates

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that deterministic per-action safety gates can be defeated by distributed attacks that split harmful intent across individually compliant steps, leaving a gap in “temporal” security at the session/trajectory level.
  • It introduces Session Risk Memory (SRM), a lightweight deterministic module that adds trajectory-level authorization by maintaining a compact semantic centroid and accumulating a risk signal via an exponential moving average of baseline-subtracted gate outputs.
  • SRM is designed to require no additional model components, training, or probabilistic inference, because it operates on the same semantic vector representation as the underlying authorization gate.
  • Experiments on an 80-session multi-turn benchmark (slow-burn exfiltration, gradual privilege escalation, and compliance drift) show ILION+SRM achieving F1=1.0000 with 0% false positives versus stateless ILION at F1=0.9756 with a 5% false-positive rate, while keeping 100% detection for both.
  • The approach formalizes a distinction between spatial authorization consistency (per action) and temporal authorization consistency (over trajectory), aiming to provide a principled basis for session-level safety in agentic systems with <250 microseconds per-turn overhead.

Abstract

Deterministic pre-execution safety gates evaluate whether individual agent actions are compatible with their assigned roles. While effective at per-action authorization, these systems are structurally blind to distributed attacks that decompose harmful intent across multiple individually-compliant steps. This paper introduces Session Risk Memory (SRM), a lightweight deterministic module that extends stateless execution gates with trajectory-level authorization. SRM maintains a compact semantic centroid representing the evolving behavioral profile of an agent session and accumulates a risk signal through exponential moving average over baseline-subtracted gate outputs. It operates on the same semantic vector representation as the underlying gate, requiring no additional model components, training, or probabilistic inference. We evaluate SRM on a multi-turn benchmark of 80 sessions containing slow-burn exfiltration, gradual privilege escalation, and compliance drift scenarios. Results show that ILION+SRM achieves F1 = 1.0000 with 0% false positive rate, compared to stateless ILION at F1 = 0.9756 with 5% FPR, while maintaining 100% detection rate for both systems. Critically, SRM eliminates all false positives with a per-turn overhead under 250 microseconds. The framework introduces a conceptual distinction between spatial authorization consistency (evaluated per action) and temporal authorization consistency (evaluated over trajectory), providing a principled basis for session-level safety in agentic systems.