Differences Between Kimi K2.5 and Kimi K2.6 on MineBench

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The post compares Kimi K2.5 and Kimi K2.6 specifically on MineBench, a benchmark that evaluates a model’s ability to generate 3D Minecraft-like structures.
  • The author notes that Kimi’s results can be inconsistent: some builds show a high ceiling, while others are noticeably lower in quality than their counterparts.
  • The author concludes that both versions are major improvements over Kimi K2.5 overall, but K2.6’s performance may vary more across different builds.
  • The total reported cost to run the benchmark was $2.35, and the author claims this makes Kimi the most cost-effective option for its achieved performance.
  • The post provides links to MineBench and the associated GitHub repository, along with references to earlier model-comparison posts.
Differences Between Kimi K2.5 and Kimi K2.6 on MineBench

Some Notes:

  • The one caveat though is that I find Kimi's results to be quite inconsistent; the model clearly has a very high ceiling, but you'll see that some of it's builds (in my opinion) lack in quality compared to the others (though they're all a massive improvement from Kimi K2.5)
  • Total cost was $2.35
    • Think this is by far the most cost effective model for it's performance
    • If you enjoy these posts please feel free to help fund the benchmark

Benchmark: https://minebench.ai/
Git Repository: https://github.com/Ammaar-Alam/minebench

Previous Posts:

Previous Posts:

Extra Information (if you're confused):

Essentially it's a benchmark that tests how well a model can create a 3D Minecraft like structure.

So the models are given a palette of blocks (think of them like legos) and a prompt of what to build, so like the first prompt you see in the post was a fighter jet. Then the models had to build a fighter jet by returning a JSON in which they gave the coordinate of each block/lego (x, y, z). It's interesting to see which model is able to create a better 3D representation of the given prompt.

The smarter models tend to design much more detailed and intricate builds. The repository readme might provide might help give a better understanding.

(Disclaimer: This is a public benchmark I created, so technically self-promotion :)

submitted by /u/ENT_Alam
[link] [comments]