| Still need more matches for reliable data but GLM 5.1 looks to be very competitive with other frontier models. This uses a benchmark I made that pits LLMs against each other in autonomous games of Blood on the Clocktower (a complex social deduction game) - last screenshot shows GLM 5.1 playing as the evil team (red). For contrast, With a 0% tool error rate. Very impressive. [link] [comments] |
GLM 5.1 sits alongside frontier models in my social reasoning benchmark
Reddit r/LocalLLaMA / 4/13/2026
💬 OpinionSignals & Early TrendsModels & Research
Key Points
- A community benchmark using autonomous play in the social deduction game “Blood on the Clocktower” finds that GLM 5.1 appears highly competitive with other frontier LLMs, though the tester notes more match data is needed for reliability.
- The benchmark pits LLMs against each other in complex social reasoning gameplay, with examples showing GLM 5.1 performing as the evil team.
- Reported cost comparisons indicate GLM 5.1 is substantially cheaper per game than Claude Opus 4.6 (about $0.92 vs. $3.69), while achieving a 0% tool error rate in the described runs.
- Overall, the post highlights strong practical performance signals for GLM 5.1 in social reasoning-style tasks, while framing the results as preliminary due to limited sample size.


