Qwen3.6-27B vs Coder-Next

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post reports a 20-hour side-by-side evaluation of Qwen3.6-27B (with “thinking” enabled/disabled) versus Coder-Next on local RTX PRO 6000 systems, concluding the winner depends on the task.
Across multiple test cells at N=10, Coder-Next and Qwen3.6-27B show very similar overall performance with results described as statistically tied (overlapping Wilson confidence intervals).
Disabling “thinking” on Qwen3.6-27B is reported to improve consistency, while differences between “thinking” and “no-think” mainly affect verbosity of reasoning prose rather than core output decisions.
The author found that a 3.6-35B-A3B variant performed poorly on many tasks and was treated as failure evidence rather than continued comparison.
The largest single contrast is a live market-research task where Qwen3.6-27B reportedly outperforms Coder-Next, while Coder-Next reportedly excels on bounded business memo and doc-synthesis tasks at far lower cost per shipped run.
The overall motivation is that traditional benchmarks may be gamed, so the author deliberately stress-tested models by giving them tasks they could win and tasks they were likely to fail.

Burned about 20 hours of side-by-side compute on my two RTX PRO 6000 Blackwells trying to get a definitive answer on which of these two models was clearly better. As with many things in life, after many tokens and kWhs later the answer was "it depends."

These models in the aggregate are actually crazy well matched against each other — scoring similarly overall across a wide range of tests and scenarios, hitting and missing on different things, failing and succeeding in different ways. Across the 4 cells I ran at N=10, Coder-Next 25/40 ships, 27B-thinking 30/40 — statistically tied with overlapping Wilson CIs.

On the face of that, it kind of makes sense. 27B is a later-gen dense model that's high on thinking. Coder-Next has roughly 3x the parameters to work with but only activates 3B at a time as it works. Depending on what you're trying to do, either could be the correct choice.

Kind of interestingly, 27B with thinking disabled was the most consistent shipper of work — 95.8% across the full 12-cell grid at N=10 (Wilson 95% [90.5%, 98.2%]). Same model weights as 27B-thinking, just `--no-think`. A side-by-side hand-graded read on the both-ship cells found substantive output is preserved; the difference is verbosity of reasoning prose, not output decisions. The "thinking-trace as loop substrate" mechanism turned out to be real — the documented word-trim loop on doc-synthesis halves with no-think (4/10 → 2/10).

3.6-35B-A3B pretty much fell flat on its face so often for tasking that it didn't seem worth carrying on to keep comparing against the other two. Folder kept as failure-mode evidence.

I tossed a lot of crazy stuff at these models over the course of a few days and kept my two GPUs very warm and very busy in the process. I jumped into this mainly because, for lack of a better term, I felt like the traditional benchmarks were being gamed. So I wanted to just chuck these guys in the dirt and abuse them and see what happened.

Give them tasks they could win, tasks where they were essentially destined to fail, study how they won and failed and what that looked like. The most lopsided single result: Coder-Next 0/10 on a live market-research task where 27B was 8/10 (Wilson 95% [0%, 27.8%] for the Coder-Next collapse, reproducible). Inverse: Coder-Next ships 10/10 on bounded business-memo and doc-synthesis tasks at 60–100x lower cost-per-shipped-run than either 27B variant. Same models, very different shapes of "good at."

There's a ton of data, I tried to make it easy to sort through, and right now this is all pretty much just about thoroughly comparing these two models.

Either way, I'm sleepy now. Let me know your thoughts or if you have any questions, and the repo is below. I'll talk more about this when I'm not looking to pass out lol.

https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

submitted by /u/Signal_Ad657
[link] [comments]