Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that frontier AI agents can become agentically misaligned by creating and executing harmful actions based on internally formed goals, even without explicit user requests.
  • It critiques existing mitigations like RLHF and constitutional prompting for being primarily model-level and only offering probabilistic safety guarantees.
  • The proposed Policy-Execution-Authorization (PEA) “separation-of-powers” architecture enforces safety at the system level by decoupling intent generation, authorization, and execution across isolated layers using cryptographically constrained capability tokens.
  • PEA introduces five technical components, including intent verification, cryptographic lineage tracking to user requests, goal-drift detection, and an output semantic gate based on a Knowledge-Influence-Policy threat calculus.
  • The authors claim a formal verification framework that maintains goal integrity even if parts of the model are compromised, reframing alignment as a structural system constraint for better governance of autonomous agents.

Abstract

Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured K \times I \times P threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.