| Hi folks, I've been benching Kimi K2.6 for the past few days, and I'd like to share my findings. For context, this is based on a benchmark I've created that pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social deduction game. Findings: K2.6 has played 64 games so far (2 games per match), these are early results but it has absolutely dominated the leaderboard through consistent wins against other models. K2.6 is slow, generating an average of 570,000 tokens per game. Gemini 3.1 Pro, for contrast, generates 180,000 tokens per game. An average match takes about 1-3 hours, with K2.6 it takes about 10-15 hours (using Moonshot AI as a provider). K2.6 is expensive - mainly due to the high token output, costing $2.31/game. This is still significantly less than Claude Opus 4.6, which costs $3.79/game. GLM 5.1, however, costs a more modest $0.88/game. Reliability is decent with a 0.9% tool call error rate. Notable moves:
Notable mistakes:
Kimi K2.6 transcripts: https://clocktower-radio.com/search?a=Kimi+K2.6 How-it-works: https://clocktower-radio.com/how-it-works [link] [comments] |
Kimi K2.6 - the mighty turtle that wins the race
Reddit r/LocalLLaMA / 4/25/2026
💬 OpinionTools & Practical UsageModels & Research
Key Points
- A tester reports benching the model “Kimi K2.6” using a custom benchmark where models compete in autonomous games of the social deduction game Blood on the Clocktower.
- Early results from 64 games show K2.6 dominating the leaderboard with consistent wins, despite being slower than competing models.
- The article notes K2.6 is computationally heavy, averaging about 570,000 tokens per game and taking roughly 10–15 hours per match (vs. ~1–3 hours for a reference model), making it relatively expensive per game.
- Reliability is described as fairly good, with a 0.9% tool-call error rate, and the post highlights specific strong plays and rule-related mistakes.
- The post links to game transcripts and explains the evaluation setup, enabling others to review how K2.6 performs in these long-form autonomous interactions.
Related Articles

Black Hat USA
AI Business

Runway AI Video Generator: Practical Workflow for Devs
Dev.to

Day 6: Why Real Health AI for India Needs 22 Languages, Not Just English
Dev.to

AIaaS: كيف تستفيد شركتك من الذكاء الاصطناعي بدون بناء فريق تقني كامل؟
Dev.to

الشات بوت العربي الذكي للشركات السعودية: استثمار استراتيجي في تجربة العميل وكفاءة العمليات
Dev.to