I spent the past week testing a simple question:
Small local models often look weak inside coding agents. But how much of that is actually model weakness, and how much is scaffold mismatch?
So I held the model fixed and changed only the scaffold.
Same Qwen3.5-9B Q4 weights in both conditions.
Same Aider Polyglot benchmark.
Full 225 exercises.
Results:
- vanilla Aider: 19.11%
- little-coder: 45.56% mean pass@2 across two full runs
little-coder is not a new model. It is a scaffold I adapted to the behavioral profile of a ~10B local model: bounded reasoning budget, a Write guard that refuses to overwrite existing files, explicit workspace discovery, and small per-turn skill injections instead of one huge static preamble.
This is not a conference paper. There are obvious things a proper paper would still want:
- more replications
- component ablations
- more model families
- maybe a second benchmark
But the effect size was large enough that I thought it was worth sharing now (I don’t have time to do the above unfortunately).
My takeaway is fairly narrow:
at this scale, coding-agent benchmark results are not just properties of model weights. They are also properties of scaffold–model fit.
I suspect sub-10B local models may have been written off too early in coding-agent evaluation.
Full write-up, code, and numbers here: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent
Would be very interested in replication attempts, failure cases, or reasons you think this would not generalize.
[link] [comments]
