12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
arXiv cs.AI / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a cinematic “12 Angry Men” scenario as a multi-agent benchmark to test how LLM jurors deliberate and whether a single dissenter can shift the group toward a different verdict.
- Using 12 agents with film-faithful personas, the study compares GPT-4o and Llama-4-Scout under three prompting conditions, finding that 17 out of 18 runs result in hung juries rather than gradual minority-to-majority persuasion.
- GPT-4o shows low deliberative flexibility, with about 1.0 vote change per run across conditions, while Llama-4-Scout varies widely (2.0 to 6.0 vote changes) and is the only model to reach a NOT_GUILTY verdict.
- The authors conclude that the strength of RLHF alignment training—not raw model capability—is the main driver of deliberative flexibility in multi-agent LLM settings.
- The work is positioned as an exploratory study with implications for evaluating “jury-of-LLMs” systems and for designing multi-agent debate benchmarks.
Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents
Dev.to

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS
Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool
Dev.to
AI is getting better at doing things, but still bad at deciding what to do?
Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny
Dev.to