SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
arXiv cs.AI / 4/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- SocialGrid is a new embodied multi-agent benchmark for evaluating LLM agents on planning, task execution, and social reasoning in an environment inspired by Among Us.
- Experiments show that even the strongest open model tested (GPT-OSS-120B) achieves under 60% accuracy on task completion and planning, often getting stuck in repetitive behaviors or failing basic navigation.
- To prevent navigation/planning weaknesses from masking social-intelligence performance, SocialGrid includes an optional Planning Oracle that separates planning deficits from social reasoning evaluation.
- The results indicate that deception detection remains a major bottleneck, performing near random chance even as model size scales, suggesting reliance on shallow heuristics rather than evidence accumulation.
- SocialGrid also offers automatic failure analysis with fine-grained metrics and includes an Elo-based leaderboard from adversarial league play for ongoing comparison.
Related Articles
Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]
Reddit r/MachineLearning

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark
Dev.to

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting
Dev.to

The $20/month AI subscription is gaslighting developers in emerging markets
Dev.to

A Claude Code hook that warns you before calling a low-trust MCP server
Dev.to