12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

arXiv cs.AI / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a cinematic “12 Angry Men” scenario as a multi-agent benchmark to test how LLM jurors deliberate and whether a single dissenter can shift the group toward a different verdict.
  • Using 12 agents with film-faithful personas, the study compares GPT-4o and Llama-4-Scout under three prompting conditions, finding that 17 out of 18 runs result in hung juries rather than gradual minority-to-majority persuasion.
  • GPT-4o shows low deliberative flexibility, with about 1.0 vote change per run across conditions, while Llama-4-Scout varies widely (2.0 to 6.0 vote changes) and is the only model to reach a NOT_GUILTY verdict.
  • The authors conclude that the strength of RLHF alignment training—not raw model capability—is the main driver of deliberative flexibility in multi-agent LLM settings.
  • The work is positioned as an exploratory study with implications for evaluating “jury-of-LLMs” systems and for designing multi-agent debate benchmarks.

Abstract

What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi-agent benchmark for LLM deliberation: twelve agents, each conditioned on a film-faithful persona, debate the film's murder case using multi-agent framework. Two models representing opposite ends of the RLHF spectrum are tested: GPT-4o (closed-source, heavy alignment) and Llama-4-Scout (open-weight, lighter alignment), across three conditions (baseline, open-minded prompt, no initial vote), with N = 3 replications per cell (18 runs total). Three findings emerge. (i) Seventeen of eighteen runs end in a hung jury (a state where the jury fails to reach a unanimous verdict); the film's central event, gradual minority-to-majority persuasion, almost never occurs, indicating that anchoring is the dominant failure mode of current LLMs in this setting. (ii) The two models exhibit sharply different internal dynamics: GPT-4o produces a mean of 1.0 vote changes per run across all conditions, while Llama-4-Scout ranges from 2.0 (baseline) to 6.0 (open-minded prompt), and is the only model to reach a NOT\_GUILTY verdict (1 of 3 runs in the no-initial-vote condition). The same ``open-minded'' instruction is internalized by Llama and ignored by GPT-4o. (iii) This asymmetry suggests that the intensity of RLHF alignment training, not model capability, is the primary determinant of deliberative flexibility in multi-agent settings. Flexibility, not capability, tracks human deliberation. The work is framed as an exploratory study and discusses implications for jury-of-LLMs evaluation and multi-agent debate.