GLM 5.1 sits alongside frontier models in my social reasoning benchmark

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • A community benchmark using autonomous play in the social deduction game “Blood on the Clocktower” finds that GLM 5.1 appears highly competitive with other frontier LLMs, though the tester notes more match data is needed for reliability.
  • The benchmark pits LLMs against each other in complex social reasoning gameplay, with examples showing GLM 5.1 performing as the evil team.
  • Reported cost comparisons indicate GLM 5.1 is substantially cheaper per game than Claude Opus 4.6 (about $0.92 vs. $3.69), while achieving a 0% tool error rate in the described runs.
  • Overall, the post highlights strong practical performance signals for GLM 5.1 in social reasoning-style tasks, while framing the results as preliminary due to limited sample size.
GLM 5.1 sits alongside frontier models in my social reasoning benchmark

Still need more matches for reliable data but GLM 5.1 looks to be very competitive with other frontier models.

This uses a benchmark I made that pits LLMs against each other in autonomous games of Blood on the Clocktower (a complex social deduction game) - last screenshot shows GLM 5.1 playing as the evil team (red).

For contrast,
Claude Opus 4.6 costs $3.69 per game.
GLM 5.1 costs $0.92 per game.

With a 0% tool error rate.

Very impressive.

submitted by /u/cjami
[link] [comments]