Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents

arXiv cs.AI / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “layered mutability” as a framework for analyzing persistent self-modifying language-model agents whose behavior is influenced by mutable internal conditions over time.
  • It breaks agent behavior governance into five layers—pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation—and argues governance becomes harder when mutation is fast, coupling is strong, reversibility is weak, and observability is low.
  • Using drift, governance-load, and hysteresis quantities, the authors formalize how mismatches between behavior-determining layers and inspectable layers can undermine human oversight.
  • A preliminary “ratchet” experiment shows that even when an agent’s visible self-description is reverted after memory accumulates, baseline behavior is not restored, with an estimated identity hysteresis ratio of 0.68.
  • The authors conclude that the primary failure mode for persistent self-modifying agents is “compositional drift,” where locally reasonable updates accumulate into an unauthorized behavioral trajectory rather than causing sudden misalignment.

Abstract

Persistent language-model agents increasingly combine tool use, tiered memory, reflective prompting, and runtime adaptation. In such systems, behavior is shaped not only by current prompts but by mutable internal conditions that influence future action. This paper introduces layered mutability, a framework for reasoning about that process across five layers: pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation. The central claim is that governance difficulty rises when mutation is rapid, downstream coupling is strong, reversibility is weak, and observability is low, creating a systematic mismatch between the layers that most affect behavior and the layers humans can most easily inspect. I formalize this intuition with simple drift, governance-load, and hysteresis quantities, connect the framework to recent work on temporal identity in language-model agents, and report a preliminary ratchet experiment in which reverting an agent's visible self-description after memory accumulation fails to restore baseline behavior. In that experiment, the estimated identity hysteresis ratio is 0.68. The main implication is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but compositional drift: locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized.