Qwen3.6 35B + the right coding scaffold got my local setup to 9/10 on real Go tasks

Reddit r/LocalLLaMA / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A personal Go coding evaluation found that a locally routed setup can reach near-frontier performance: 9/10 tests passing on a 10-task real-world Go benchmark compared with a GPT-5.4 best-of baseline of 10/10.
  • The major gains came less from switching models and more from using an appropriate “coding scaffold,” where Qwen3.6 35B with little-coder achieved 8/10 versus a previous Gandalf-based harness scoring only 3/10.
  • Failures with Qwen3.6+lttle-coder were concentrated in deterministic fake-clock/timer/ticker logic and a SQLite-related case that passed on rerun, suggesting some task variability and model-scaffold mismatch.
  • A routed approach improved results by selecting different local models/policies per task type—using Qwen30+lttle-coder for clock/timer-shaped tasks—and applying automated repair steps like goimports/gofmt, go mod tidy, and timeouts.
  • The setup ran on home hardware with Ollama and vLLM/OpenAI-compatible endpoints (RTX 5090 for Qwen3.6 and a second GPU role for “Gandalf”), indicating practical feasibility of sophisticated local coding agents.

I wanted to test a slightly different question than "can one open model beat GPT-5.4 Codex?"

The question was:

Can a combination of local models, scaffolding, repair loops, and routing policies running on home hardware get close enough to frontier coding models on my actual workload?

Short version: yes, surprisingly. On my first curated 10-task Go eval set, a routed local process got to 9/10 passing tests.

Links:

- little-coder: https://github.com/itayinbarr/little-coder

- The write-up that prompted this experiment: https://open.substack.com/pub/itayinbarr/p/honey-i-shrunk-the-coding-agent

  • GPT-5.4 best-of baseline 10/10
  • Routed local process 9/10
  • Qwen3.6 + little-coder 8/10
  • Qwen30 + little-coder 5/10
  • Original local Gandalf harness 3/10

This was not a public benchmark. It was 10 real tasks extracted from my own Go repo, using copied workspaces so the live repo was not touched. The tasks include CLI changes, dependency enforcement, embedded version files, clock abstractions, error taxonomy, SQLite primitives, migrations, and baseline schema work.

## Hardware

The local setup:

  • RTX 5090 32GB running Ollama on Frodo
  • RTX Pro 6000 96GB available as Gandalf for the larger local repair/editor role
  • Qwen3.6 35B A3B Q4_K_M on the 5090
  • Qwen3-Coder 30B also available locally
  • Qwen3-Coder-Next 80B on Gandalf through a vLLM/OpenAI-compatible endpoint

Qwen3.6 loaded on the 5090 at about 27GB VRAM, which left enough room for my embedding service to stay up.

## The important part was the scaffold

The biggest improvement did not come from simply swapping models.

Earlier, I had a more basic local Aider-style harness around Gandalf. That got only 3/10 on the same kind of tasks. It was not useless, but it clearly was not competitive with frontier coding agents.

Then I tried little-coder with Qwen3.6 35B after seeing the argument that local coding models are often being tested inside scaffolds that are poorly matched to them.

That changed the result a lot.

Qwen3.6 + little-coder alone passed 8/10. The failures were:

  • - one deterministic fake-clock / timer / ticker task
  • - one SQLite task on one run, which later passed on rerun

The routed local process got to 9/10 by combining:

  • - Qwen3.6 + little-coder as the default local implementer
  • - Qwen30 + little-coder for fake-clock/timer/ticker-shaped tasks
  • - deterministic harness fixups like `goimports`, `gofmt`, `go mod tidy`, and `go test -timeout`
  • - Gandalf direct file repair for narrow compile/import/schema failures

The current routed result:

little-coder-routed-local: 4.60/5 avg | 9/10 tests pass | $0.00 | 1489s

Per-task:

001 pass

002 pass

003 pass

004 pass

005 pass

006 fail

007 pass

008 pass

009 pass

010 pass

The one remaining failure was the deterministic fake-clock task. It requires getting timers, tickers, scheduled deadlines, goroutine wakeups, and leak behavior exactly right. The local models kept producing plausible implementations that either deadlocked or delivered ticks at the wrong time.

## What surprised me

Qwen3.6 was dramatically better than Qwen30 on the module-sized Go tasks. In particular, it passed the store/migration/schema tasks that Qwen30 struggled with.

But Qwen3.6 was not strictly better everywhere. Qwen30 had previously solved the fake-clock task in one run, while Qwen3.6 failed it. In the full routed run, even Qwen30 failed that task due to variance.

That convinced me the right abstraction is not "pick the best model." The right abstraction is "route by task shape and failure mode."

The local system should make decisions like:

General Go module work -> Qwen3.6 + little-coder

SQL/store/migration work -> Qwen3.6 + little-coder

Narrow compile/import failure -> local Gandalf repair

Timer/ticker/concurrency bug -> specialized playbook or frontier escalation

I do not want to be the traffic controller manually. The harness should collect task shape, model choice, result, repair count, and elapsed time, then feed that into an automatic router.

## What I changed in the harness

A few practical details mattered a lot:

  1. Run evals in copied workspaces only. Never let the agent touch the live repo.
  2. Force `go test` timeouts. Fake-clock bugs can otherwise hang forever.
  3. Run deterministic cleanup outside the model: `goimports`, `gofmt`, `go mod tidy`.
  4. Make repair edits machine-parseable. I used a direct JSON file-repair path for Gandalf instead of free-form chat repair.
  5. Keep tests and testdata read-only, but allow non-Go implementation artifacts like `.sql` and `VERSION`.
  6. Record every run to disk with status JSON, test logs, diffs, and a report.

The `go test -timeout` wrapper was especially important. Before that, one bad fake-clock implementation could consume an entire eval cycle.

## Caveats

This is not a claim that Qwen3.6 beats GPT-5.4 Codex.

GPT-5.4 still got 10/10 on this slice. The local routed process got 9/10.

Also, this is only 10 tasks from one Go repo. It is useful to me because it is my real workload, but it is not a broad coding benchmark.

The result I care about is narrower:

For my Go workload, a local scaffolded and routed process is now close enough that it can probably become the default path for routine work, with frontier models reserved for harder tasks and known failure classes.

That is a big deal for cost and rate limits.

## My current conclusion

The model matters, but the scaffold matters more than I expected.

Qwen3.6 35B is strong enough to be useful locally, but it became genuinely interesting only when paired with:

  • - little-coder
  • - task-specific routing
  • - deterministic Go fixups
  • - local repair
  • - eval feedback on real tasks

The next step is to make the router smarter:

  • - run Qwen3.6 by default
  • - repair narrow local failures locally
  • - escalate fake-clock/concurrency/time semantics to frontier or a specialized playbook
  • - keep logging outcomes so the routing policy improves over time

That feels like the real path forward: not one local model trying to imitate Codex, but a local coding system that knows when and how to use each model.

(Written by me. rewritten better by codex 5.4)

submitted by /u/benfinklea
[link] [comments]