I wanted to test a slightly different question than "can one open model beat GPT-5.4 Codex?"
The question was:
Can a combination of local models, scaffolding, repair loops, and routing policies running on home hardware get close enough to frontier coding models on my actual workload?
Short version: yes, surprisingly. On my first curated 10-task Go eval set, a routed local process got to 9/10 passing tests.
Links:
- little-coder: https://github.com/itayinbarr/little-coder
- The write-up that prompted this experiment: https://open.substack.com/pub/itayinbarr/p/honey-i-shrunk-the-coding-agent
- GPT-5.4 best-of baseline 10/10
- Routed local process 9/10
- Qwen3.6 + little-coder 8/10
- Qwen30 + little-coder 5/10
- Original local Gandalf harness 3/10
This was not a public benchmark. It was 10 real tasks extracted from my own Go repo, using copied workspaces so the live repo was not touched. The tasks include CLI changes, dependency enforcement, embedded version files, clock abstractions, error taxonomy, SQLite primitives, migrations, and baseline schema work.
## Hardware
The local setup:
- RTX 5090 32GB running Ollama on Frodo
- RTX Pro 6000 96GB available as Gandalf for the larger local repair/editor role
- Qwen3.6 35B A3B Q4_K_M on the 5090
- Qwen3-Coder 30B also available locally
- Qwen3-Coder-Next 80B on Gandalf through a vLLM/OpenAI-compatible endpoint
Qwen3.6 loaded on the 5090 at about 27GB VRAM, which left enough room for my embedding service to stay up.
## The important part was the scaffold
The biggest improvement did not come from simply swapping models.
Earlier, I had a more basic local Aider-style harness around Gandalf. That got only 3/10 on the same kind of tasks. It was not useless, but it clearly was not competitive with frontier coding agents.
Then I tried little-coder with Qwen3.6 35B after seeing the argument that local coding models are often being tested inside scaffolds that are poorly matched to them.
That changed the result a lot.
Qwen3.6 + little-coder alone passed 8/10. The failures were:
- - one deterministic fake-clock / timer / ticker task
- - one SQLite task on one run, which later passed on rerun
The routed local process got to 9/10 by combining:
- - Qwen3.6 + little-coder as the default local implementer
- - Qwen30 + little-coder for fake-clock/timer/ticker-shaped tasks
- - deterministic harness fixups like `goimports`, `gofmt`, `go mod tidy`, and `go test -timeout`
- - Gandalf direct file repair for narrow compile/import/schema failures
The current routed result:
little-coder-routed-local: 4.60/5 avg | 9/10 tests pass | $0.00 | 1489s
Per-task:
001 pass
002 pass
003 pass
004 pass
005 pass
006 fail
007 pass
008 pass
009 pass
010 pass
The one remaining failure was the deterministic fake-clock task. It requires getting timers, tickers, scheduled deadlines, goroutine wakeups, and leak behavior exactly right. The local models kept producing plausible implementations that either deadlocked or delivered ticks at the wrong time.
## What surprised me
Qwen3.6 was dramatically better than Qwen30 on the module-sized Go tasks. In particular, it passed the store/migration/schema tasks that Qwen30 struggled with.
But Qwen3.6 was not strictly better everywhere. Qwen30 had previously solved the fake-clock task in one run, while Qwen3.6 failed it. In the full routed run, even Qwen30 failed that task due to variance.
That convinced me the right abstraction is not "pick the best model." The right abstraction is "route by task shape and failure mode."
The local system should make decisions like:
General Go module work -> Qwen3.6 + little-coder
SQL/store/migration work -> Qwen3.6 + little-coder
Narrow compile/import failure -> local Gandalf repair
Timer/ticker/concurrency bug -> specialized playbook or frontier escalation
I do not want to be the traffic controller manually. The harness should collect task shape, model choice, result, repair count, and elapsed time, then feed that into an automatic router.
## What I changed in the harness
A few practical details mattered a lot:
- Run evals in copied workspaces only. Never let the agent touch the live repo.
- Force `go test` timeouts. Fake-clock bugs can otherwise hang forever.
- Run deterministic cleanup outside the model: `goimports`, `gofmt`, `go mod tidy`.
- Make repair edits machine-parseable. I used a direct JSON file-repair path for Gandalf instead of free-form chat repair.
- Keep tests and testdata read-only, but allow non-Go implementation artifacts like `.sql` and `VERSION`.
- Record every run to disk with status JSON, test logs, diffs, and a report.
The `go test -timeout` wrapper was especially important. Before that, one bad fake-clock implementation could consume an entire eval cycle.
## Caveats
This is not a claim that Qwen3.6 beats GPT-5.4 Codex.
GPT-5.4 still got 10/10 on this slice. The local routed process got 9/10.
Also, this is only 10 tasks from one Go repo. It is useful to me because it is my real workload, but it is not a broad coding benchmark.
The result I care about is narrower:
For my Go workload, a local scaffolded and routed process is now close enough that it can probably become the default path for routine work, with frontier models reserved for harder tasks and known failure classes.
That is a big deal for cost and rate limits.
## My current conclusion
The model matters, but the scaffold matters more than I expected.
Qwen3.6 35B is strong enough to be useful locally, but it became genuinely interesting only when paired with:
- - little-coder
- - task-specific routing
- - deterministic Go fixups
- - local repair
- - eval feedback on real tasks
The next step is to make the router smarter:
- - run Qwen3.6 by default
- - repair narrow local failures locally
- - escalate fake-clock/concurrency/time semantics to frontier or a specialized playbook
- - keep logging outcomes so the routing policy improves over time
That feels like the real path forward: not one local model trying to imitate Codex, but a local coding system that knows when and how to use each model.
(Written by me. rewritten better by codex 5.4)
[link] [comments]



