Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The author tests whether weak performance of small local coding models comes from model capability or “scaffold mismatch” by keeping the Qwen3.5-9B Q4 weights fixed and changing only the agent scaffold.
  • Using the same Aider Polyglot benchmark with 225 exercises, the vanilla Aider setup scores 19.11% pass@2, while an adapted scaffold (“little-coder”) scores 45.56% pass@2.
  • The “little-coder” scaffold is tailored to the behavioral profile of ~10B local models, including a bounded reasoning budget, a write guard to prevent overwriting files, explicit workspace discovery, and smaller per-turn skill injections rather than a single large preamble.
  • The write-up argues that at this model scale, coding-agent benchmark outcomes reflect both model weights and the fit between scaffold and model behavior, suggesting sub-10B models may be underestimated in evaluation.
  • The author calls for replication, ablation studies, and broader benchmarking to verify generalization and understand failure modes.

I spent the past week testing a simple question:

Small local models often look weak inside coding agents. But how much of that is actually model weakness, and how much is scaffold mismatch?

So I held the model fixed and changed only the scaffold.

Same Qwen3.5-9B Q4 weights in both conditions.

Same Aider Polyglot benchmark.

Full 225 exercises.

Results:

- vanilla Aider: 19.11%

- little-coder: 45.56% mean pass@2 across two full runs

little-coder is not a new model. It is a scaffold I adapted to the behavioral profile of a ~10B local model: bounded reasoning budget, a Write guard that refuses to overwrite existing files, explicit workspace discovery, and small per-turn skill injections instead of one huge static preamble.

This is not a conference paper. There are obvious things a proper paper would still want:

- more replications

- component ablations

- more model families

- maybe a second benchmark

But the effect size was large enough that I thought it was worth sharing now (I don’t have time to do the above unfortunately).

My takeaway is fairly narrow:

at this scale, coding-agent benchmark results are not just properties of model weights. They are also properties of scaffold–model fit.

I suspect sub-10B local models may have been written off too early in coding-agent evaluation.

Full write-up, code, and numbers here: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent

Would be very interested in replication attempts, failure cases, or reasons you think this would not generalize.

submitted by /u/Creative-Regular6799
[link] [comments]