Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

arXiv cs.AI / 3/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

UCIP provides a multi-criterion detection framework that distinguishes terminal continuation objectives from instrumental continuation in autonomous agents by analyzing latent trajectories via a Quantum Boltzmann Machine.
It encodes trajectories with a Quantum Boltzmann Machine and uses the von Neumann entropy of a reduced density matrix to quantify cross-partition entanglement, correlating entanglement with continuation weighting.
In gridworld experiments with ground-truth objectives, UCIP achieved 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation, with an entanglement gap (Delta = 0.381, p < 0.001) and a strong Pearson correlation (r = 0.934) across an interpolation sweep.
The work emphasizes that all computations are classical and that the term "quantum" is a mathematical formalism; UCIP detects latent statistical structure related to objectives, not consciousness or subjective experience.

Abstract

Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continued operation as a terminal objective and one that does so merely instrumentally can produce observationally similar trajectories. External behavioral monitoring cannot reliably distinguish between them. We introduce the Unified Continuation-Interest Protocol (UCIP), a multi-criterion detection framework that moves this distinction from behavior to the latent structure of agent trajectories. UCIP encodes trajectories with a Quantum Boltzmann Machine (QBM), a classical algorithm based on the density-matrix formalism of quantum statistical mechanics, and measures the von Neumann entropy of the reduced density matrix induced by a bipartition of hidden units. We test whether agents with terminal continuation objectives (Type A) produce latent states with higher entanglement entropy than agents whose continuation is merely instrumental (Type B). Higher entanglement reflects stronger cross-partition statistical coupling. On gridworld agents with known ground-truth objectives, UCIP achieves 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation under the frozen Phase I gate. The entanglement gap between Type A and Type B agents is Delta = 0.381 (p < 0.001, permutation test). Pearson r = 0.934 across an 11-point interpolation sweep indicates that, within this synthetic family, UCIP tracks graded changes in continuation weighting rather than merely a binary label. Among the tested models, only the QBM achieves positive Delta. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP does not detect consciousness or subjective experience; it detects statistical structure in latent representations that correlates with known objectives.

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

Key Points

Abstract

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer