Causal Foundations of Collective Agency

arXiv cs.AI / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key AI safety risk: multiple simpler agents could unintentionally coordinate into a collective agent with its own distinct goals and capabilities.
It proposes a behavioral criterion for when a group should be treated as a unified collective agent—specifically, when the group’s joint actions are well-predicted as rational, goal-directed behavior.
The authors formalize collective agency using causal games (causal models of strategic multi-agent interactions) and causal abstraction (conditions under which a simplified model faithfully represents a more complex one).
They apply the framework to resolve a multi-agent incentives puzzle in actor-critic models and to quantitatively measure how much collective agency is induced by different voting mechanisms.
The work is intended as a theoretical foundation for understanding, predicting, and controlling emergent collective agents in multi-agent AI systems via both future theory and empirical study.

Abstract

A key challenge for the safety of advanced AI systems is the possibility that multiple simpler agents might inadvertently form a collective agent with capabilities and goals distinct from those of any individual. More generally, determining when a group of agents can be viewed as a unified collective agent is a foundational question in the study of interactions and incentives in both biological and artificial systems. We adopt a behavioral perspective in answering this question, ascribing collective agency to a group when viewing the group's joint actions as rational and goal-directed successfully predicts its behavior. We formalize this perspective on collective agency using causal games -- which are causal models of strategic, multi-agent interactions -- and causal abstraction -- which formalizes when a simple, high-level model faithfully captures a more complex, low-level model. We use this framework to solve a puzzle regarding multi-agent incentives in actor-critic models and to make quantitative assessments of the degree of collective agency exhibited by different voting mechanisms. Our framework aims to provide a foundation for theoretical and empirical work to understand, predict, and control emergent collective agents in multi-agent AI systems.