Slower Means Faster: Why I Switched from Qwen3 Coder Next to Qwen3.5 122B

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A user reports that Qwen3 Coder Next looked fast on paper (~1000 t/s prompt processing, ~37 t/s generation) but repeatedly caused backend crashes during a week of local testing.
  • With an agentic, low-babysitting workflow, the model completed only about 15 of 110 tasks on a good day due to failures and instability, even after trying different backends and configurations.
  • The user switched to Qwen3.5 122B, which had worse raw throughput on their hardware (~700 t/s prefill, ~17 t/s generation), expecting a slowdown.
  • Despite lower token speed, the 122B model completed roughly twice the total work in the same time, with fewer failures, less retrying, more stable backend behavior, and higher code quality that required less human correction.
  • The post argues that real-world agentic coding throughput depends on stability and output reliability, not just raw token/s metrics, and recommends trying 122B+ models for complex coding agents when hardware allows.
Slower Means Faster: Why I Switched from Qwen3 Coder Next to Qwen3.5 122B

https://preview.redd.it/jn22okg8elrg1.png?width=1024&format=png&auto=webp&s=49232d4474d8c7aa5d3f8f2e85f7dc8ba16abe78

I spent about a week running Qwen3 Coder Next on my local rig. Numbers looked great on paper ~1000 t/s prompt processing, ~37 t/s generation. I was using a Ralph-style agentic approach, keeping my manual involvement minimal while the model worked through tasks autonomously.

The problem? My backend was crashing constantly. Even when it ran stable for a couple hours straight, actual progress was painfully slow. My experimental project was split into 110 tasks. On a good day, Qwen3 Coder Next knocked out maybe 15 of them. I tried different backends, different configs - same story.

Eventually I got fed up and decided to just try something heavier: Qwen3.5 122B.

The specs are noticeably worse - around 700 t/s prefill and 17 t/s generation on my RTX 5070 TI + potato DDR4 96gb. Roughly half the throughput across the board. I expected to feel that slowdown.

What actually happened surprised me. The 122B model was completing roughly twice the work in the same amount of time. More tasks done, fewer failures, less babysitting. The backend stayed stable, outputs required fewer retries, and the code quality meant less back-and-forth to fix things.

It's one of those counterintuitive hardware/AI lessons: raw token speed doesn't equal real-world throughput. A faster model that hallucinates more, crashes more, or produces shakier code ends up costing you far more time than the tokens it saved.

If your hardware can handle it, I genuinely recommend trying 122B+ scale models for complex agentic coding tasks. The difference on my project was night and day.

submitted by /u/Fast_Thing_7949
[link] [comments]
広告