Notes on what actually breaks when you run a coding agent on small local models

Reddit r/LocalLLaMA / 4/30/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Running coding agents on small local and free-tier cloud models shows repeat failure modes that are not fully captured by standard benchmarks.
  • The most common issue is that small models ignore “raw code only / no markdown” instructions and still wrap answers in triple backticks, so code tools should strip markdown fences by default.
  • Structured (machine-parseable) outputs like JSON become unreliable on models below ~7B parameters, especially for complex multi-step tasks with edge cases.
  • Another frequent breakdown is incorrect edits: small models may rename or modify the wrong file or the wrong function even when the prompt includes function names and a project map, so orchestration must validate file paths and symbol existence.
  • Practical mitigation is to validate outputs, retry with stricter instructions once, and then use a permissive JSON-from-text parser while enforcing hard checks on referenced files/functions.

I've spent the last few weeks running real multi-file coding tasks through small local models and small cloud models on free tiers. Wanted to share the failure points that came up consistently, since some of them surprised me and i wanted to share with the community so maybe it helps someone.

Markdown fences are the most common failure across every small model I tested.

You can put "output only raw code, no markdown formatting" in the system prompt. The model agrees. The model also wraps its response in triple backticks anyway, especially when the request involves anything that looks like explaining code. Qwen3.5:9b and gemma4:e4b are the most consistent at following the instruction but still slip occasionally. Others from my testing fail this rule frequently enough that you basically have to assume the fences will be there.

The fix isn't better prompting. It's stripping fences in post-processing as a default. Any code-editing tool using small models has to do this.

From my testing structured output is unreliable below 7B parameters.

If your agent needs the model to return JSON for task lists (like in my caase), action types, or anything machine-parseable, small models fail at this far more often than benchmarks suggest. The benchmarks measure whether the model can produce valid JSON. They don't measure whether it produces valid JSON when given a complex multi-step instruction with edge cases.

In my testing, Gemma4:e4b is the most reliable for structured output among the local models I tried. Qwen3.5:9B is close behind. Codellama (allthoough old) struggles. On the cloud side, Llama 3.3 70B on Groq is rock solid for structured output (this was the most consistent). With other models from OpenRouter for example had some quirks. Example: Nemotron 3 super was very good, but it stopped responding on openrouter when hitting 100k tokens usage.

The practical workaround is to validate the JSON, retry once with an even more explicit instruction, then fall back to a permissive parser that can extract JSON from prose-wrapped responses.

Models will edit the wrong file if you let them.
Give a small model a task that mentions a function name, a project map listing similar function names, and a request like "rename validateToken to verifyToken." (real example from my testing). It might rename validateToken correctly. It might also rename validateUser, or modify a comment that mentions the function, or apply the rename to the wrong file entirely. The model treats the project map as suggestions, not constraints.

The fix is at the orchestration layer, not the prompt. Validate that file paths the model mentions actually exist. Validate that function names it claims to be operating on are actually in the files it claims they're in. Throw clear errors when there's a mismatch. Small models lie confidently and the agent has to not trust them.

Question vs action classification is harder than it sounds.
Asking "how many lines does utils.js have" should be a read-only operation. But if your executor only has one mode — edit this file — it will dutifully edit the file to contain the answer to your question, because the model interprets the request through the only action it knows.

The fix is having the planner classify requests into action types before any execution. Read-only queries route to a separate code path that never touches disk. Without this, a casual question can delete your file.

What works better than I expected

Token budget enforcement in code, before every call. Small models have no concept of context limits. If you trust them to be brief, they will not be brief. Counting tokens in your own code and refusing to send a too-large request is the only way to actually stay under the limit.

Per-file isolation. Sending one file at a time to the model is dramatically more reliable than sending two. Two files in the same call confuses small models surprisingly often. They mix up which fix goes where.

Synthesis-style memory. Storing what the model did last time as a one-sentence summary, not the full task list, gives enough context for the model to handle "undo" and "also add X" requests on the next turn. Doesn't need to be sophisticated.

What I'm still figuring out

Whether any local model under 7B is actually viable for an agent role, or if 7B is the practical floor. I haven't found a smaller model that doesn't fail at structured output frequently enough to be unusable. Curious if anyone has had luck with smaller fine-tunes specifically tuned for tool use or JSON output.

I open sourced the test harness if anyone wants to look or contribute: github.com/razvanneculai/litecode

Any help is highly appreciated and i would love any type of feedback.

As a disclaimer, yes i use AI to reformat some of my text because english is not my first language and i think the information is very interesting and it might help someone out.

submitted by /u/BestSeaworthiness283
[link] [comments]