Ran K2.6 through a third-party coding benchmark: heres how the figures stand up

Reddit r/LocalLLaMA / 5/6/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author reports running the Akitaonrails coding benchmark (with a fixed Rails + Ruby LLM + Docker setup) on K2.6 and says it scored 87, placing it in Tier A (80+).
  • In the same reproduced benchmark, K2.6 is reported to outperform Qwen 3.6 Plus (71), DeepSeek v4 Flash (78), and GLM 5.1, which is said to have dropped to Tier C.
  • The post emphasizes that Tier A vs Tier B reflects practical engineering behaviors like proper test mocking, error-path handling, multi-worker persistence, and typed errors—not just headline scores.
  • It also warns that many open-weight model “performance drops” may actually come from local tooling issues in 2026, such as llama.cpp bugs, missing tool-call parsers, and Ollama timeouts during long agent runs.
  • Overall, the author argues that achieving Tier A under a reproduced, methodology-fixed benchmark is a stronger claim than vendor-reported marketing results, but also notes there is still a gap at the very top (e.g., Opus 4.7 and GPT 5.4 reportedly tie at 97).

I have been following the akitaonrails coding benchmark which tests against a fixed rails + Rubyllm + docker task rather than vendor-reported evals. April 2026 update put K2.6 at 87 sitting in tier A (80+), ahead of Qwen 3.6 plus (71), Deepseek v4 flash (78), and GLM 5.1 which dropped to tir C.

for context opus 4.7 and gpt 5.4 tie at 97, so there is still a real gap at the top... but k2.6 hitting tier A on a reproduced methodology-fixed benchmark is a different claim than vendor benchmark marketing

what separates tier A from tier b in practice.... proper test mocking, error path handling, multi worker persistence, typed errors. K2.6 passes most of these. most other open weight models fail 2-3 of them silently

Practical note from the same benchmark is that half the challenge running open source locally in 2026 is the toolchain, not the model. llama.cpp bugs, missing tool-call parsers, ollama timeouts killing long agent runs. worth keeping in mind before attributing benchmark drops to the model itself.

submitted by /u/lucasbennett_1
[link] [comments]