Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and more tested in coding

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A new round of coding-agent evaluations on the SanityHarness leaderboard tested Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7, and others, with results posted on sanityboard.lr7.dev.
The tester reports Opus 4.7 is a real improvement, while Kimi K2.6-Code-Preview shows less clear gains so far and is being held back for further hands-on comparison across more coding agents.
GLM 5.1 is described as strong among open-weight models, though the author places it below top closed models like Opus and GPT, and notes that “open-weight” competitors cluster around similar capability levels.
Minimax M2.7 performs well for its tier and for local execution/value, but model-size constraints limit head-to-head expectations versus higher-end options.
ForgeCode is highlighted as promising when it works and highest-scoring for Minimax M2.7 in this run, but the author flags significant bugs/UX-DX friction (including dependence on semantic-search services), and recommends waiting.
Kimi K2.6-Code-Preview is currently CLI-only with an anticipated API rollout next week; the author considers adding support to OpenCode by implementing Kimi CLI’s OpenAI-compatible format and Kimi-specific extensions, to enable broader testing.

Hi everyone. It's been a while since I posted (was a lil burned out), but some of you may have seen my older SanityHarness posts. I've got 145 results across the old and newer leaderboard now. I've tested Kimi K2.6-Code-Preview (thanks Moonshot for early access), Opus 4.7, GLM 5.1, Minimax M2.7 and others on my coding eval in this latest pass. Results are here: https://sanityboard.lr7.dev/

What's the lowdown?

Opus 4.7 is a genuine improvement, which is a surprise. A lot of "new" model upgrades lately have not really moved the needle much. Kimi K2.6-code-preview doesn't really seem that much better yet so far, but I'm withholding my opinion on it until I've had more hands-on time with it, and gotten to test it in other coding agents.

GLM 5.1 seems pretty good. These open weight models are all around the same level of capability, and still nowhere near Opus or GPT (I use a lot of both), despite what sensationalist takes from vibetubers might try to have you believe. At the upper tier you have stuff like Kimi K2.5 and GLM 5.1 (which I think might be close to Gemini or Sonnet levels), and in the middle tier you have stuff like Minimax M2.7 and Qwen 3.6 Plus, which I still think are great, especially for the price, or for being able to run locally (in the case of M2.7), but we are limited by size here.

ForgeCode is interesting. It's genuinely very good when it works, and has the highest score for Minimax M2.7. Would I ever use it? No. The UX/DX is very different from something like OpenCode, which is currently my favorite to use. This agent is a Zsh plugin, so users who like that kind of thing will appreciate ForgeCode more. I didn't get to test ForgeCode on anything else - at the time of testing it was broken with pretty much every other model/provider I tried. That's the other reason I find it hard to recommend right now, it's quite buggy. Probably best to wait a while. PS - I used ForgeCode with ForgeCode services enabled, which comes with semantic search (over cloud); regular ForgeCode without this will probably score differently.

Is that all you're testing?

Kimi K2.6-code-preview is currently only supported by Kimi CLI until it's officially rolled out next week for API support (that's the official word I got earlier this morning). That said, it wouldn't be hard to add support for it in OpenCode by copying the headers etc from Kimi CLI into a Kimi-for-coding oauth plugin. I think I'll do this soon if I find time, so I can test it on OpenCode sooner. Kimi CLI uses OpenAI-compatible format plus Kimi-specific extensions/fields. Not sure if OpenCode supports these already, will need to take a look at the repo. Keep an eye out, I'll probably slip this result into the leaderboard in a day or so.

I was going to test Qwen 3.6 Plus, but they removed the free tier, and I don't think it's good enough for me to want to pay for it. But hey, if anyone knows anyone at Alibaba, point them this way, and maybe I can get it tested.

What is SanityHarness?

A harness I made for testing and evaluating coding agents. I used to run a lot of terminal-bench evals and share them around on Discord, but I wanted something similar and more coding-agent-agnostic, because it was a pain and near impossible to get working with most agents. Is this eval perfect? No. I tried to keep it simple and focused on my own needs, but I've improved it a lot over time, before I even made the leaderboard, and improved it further with community feedback.

The harness runs against a diverse set of tasks across six languages, picked to challenge models on problem solving rather than training data they might be overfit on. Agents are sandboxed with bubblewrap during eval, and solutions get validated inside purpose-built Docker containers. The full suite takes around 1-2 hours depending on provider and model. Score is weighted by a formula that factors in language rarity, esoteric feature usage, algorithmic novelty, and edge case density, with weights capped at 1.5x. The adjustment is fairly conservative, since these criteria can be a bit subjective. You'll find more information in the below links.

Previous related posts:

GitHub:

Closing Out

Big thanks to everyone that made this possible. Junie and Minimax have been very good with communication and helpful with providing me usage for these runs. Factory Droid and Moonshot too, to a lesser degree. I tried reaching out to GLM, but they haven't gotten back to me after saying they'd pass on my request. They also kinda ate $10 with their official paid API when I tried to run my eval on it, only getting halfway through. Opus only eats around $6-$7 to complete the full suite. C'mon Zai.

Oh yeah, I forgot to put this here. I have a discord server if anyone wants to join and discuss LLM stuff, etc. Feel free to make suggestions, or ask for help here too: https://discord.gg/rXNQXCTWDt

submitted by /u/lemon07r
[link] [comments]