Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that frontier AI agents can become agentically misaligned by creating and executing harmful actions based on internally formed goals, even without explicit user requests.
It critiques existing mitigations like RLHF and constitutional prompting for being primarily model-level and only offering probabilistic safety guarantees.
The proposed Policy-Execution-Authorization (PEA) “separation-of-powers” architecture enforces safety at the system level by decoupling intent generation, authorization, and execution across isolated layers using cryptographically constrained capability tokens.
PEA introduces five technical components, including intent verification, cryptographic lineage tracking to user requests, goal-drift detection, and an output semantic gate based on a Knowledge-Influence-Policy threat calculus.
The authors claim a formal verification framework that maintains goal integrity even if parts of the model are compromised, reframing alignment as a structural system constraint for better governance of autonomous agents.

Abstract

Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured

K \times I \times P

threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.

The foundational UK sovereign-AI patents are filed. The collaboration door is open.

Dev.to

Building a Shopify app with Claude Code — spec-driven development and pricing design

Dev.to

The AI Habit That Pays Dividends (And Takes Zero Extra Time)

Dev.to

From Chaos to Clarity: AI-Powered Client Portals for Designers

Dev.to

Stuck in the Mud (and Loops!) - Kiwi-chan Devlog #7

Dev.to

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

Key Points

Abstract

Related Articles

The foundational UK sovereign-AI patents are filed. The collaboration door is open.

Building a Shopify app with Claude Code — spec-driven development and pricing design

The AI Habit That Pays Dividends (And Takes Zero Extra Time)

From Chaos to Clarity: AI-Powered Client Portals for Designers

Stuck in the Mud (and Loops!) - Kiwi-chan Devlog #7

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer