The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

arXiv cs.AI / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses “intelligent disobedience” in shared autonomy, where an assistive AI may need to override a human instruction to prevent harm.
  • It proposes the Intelligent Disobedience Game (IDG), a sequential Stackelberg-style framework that models human leadership under asymmetric information and derives optimal strategies for both agents.
  • The analysis identifies key strategic phenomena such as “safety traps,” where the system avoids harm indefinitely but may fail to accomplish the human’s intended goal.
  • The work translates IDG into a shared-control Multi-Agent Markov Decision Process, creating a compact computational testbed for training reinforcement learning agents to learn safe non-compliance.
  • The authors position IDG as both a theoretical foundation for agent development and an experimental foundation to study how humans perceive and trust disobedient AI.

Abstract

In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.