Lessons from building a coding agent for 8k context windows: token budgeting, parallel executors, and per-file isolation

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article explains how to build an AI CLI coding agent that works within ~8k token context limits by avoiding whole-project context and instead using a role-based workflow (planner → executor(s) → orchestrator).
  • It recommends enforcing token budgeting in the program (via a canFit() check) and automatically falling back to a per-file line index to retrieve only relevant code sections when the full file would not fit.
  • It highlights parallel execution as the key to making small context windows practical: independent per-file edits can run concurrently while a pure-code dependency graph controls sequencing.
  • The author shares practical pitfalls encountered during implementation, such as question-style requests unintentionally overwriting files when the agent lacks read-only vs write semantics.
  • Overall, multi-file refactors are reframed from a “context window” problem into a scheduling and dependency-management problem.

Most AI coding tools (Cursor, Aider, Claude Code) assume you have a 200k-token model. If you're running local LLMs through Ollama or LM Studio, or hitting free-tier cloud APIs like Groq or OpenRouter, you've got around 8k tokens to work with. That doesn't fit a whole project, barely fits a single large file.

I spent the last few weeks building a CLI coding agent that's designed around the 8k constraint instead of fighting it. Wanted to share what I learned, because some of it surprised me.

The core insight: the LLM never needs to see your whole project.

Most agents try to stuff as much context as possible into a single call. With 8k tokens that's a non-starter. The approach that worked for me is splitting the work into roles:

  • A planner call that only sees a lightweight project map (Markdown summaries of each folder, ~300-500 tokens for the whole project) plus the user's request, and outputs a task list.
  • Executor calls that each see exactly one file plus one task. Never two files in the same call.
  • An orchestrator that's pure code, absolutely no LLM, building a dependency graph between tasks and deciding what runs in parallel vs sequential.

This split means the LLM only ever reasons about a small, bounded amount of code at any one time. The planner doesn't need to see code at all (just file summaries), and the executor only sees one file. Multi-file refactors stop being a context-window problem and become a scheduling problem.

Token budgeting has to be enforced in code, not promised in a prompt.

Every LLM call goes through a canFit() check that measures: system prompt + reserved output tokens + memory + actual code. If the code doesn't fit, the agent automatically falls back to a per-file line index (generated once for files over ~150 lines) and pulls only the relevant section.

Concrete budget math for 8192 tokens:

  • System prompt + instructions: ~1000
  • Reserved for response: ~2000
  • Short-term memory (4 entries): ~360
  • Available for actual code: ~4800 (about 140-190 lines)

Parallel execution is the speed multiplier that makes 8k usable.

Because each executor sees only one file, independent edits across files can run simultaneously. A 5-file refactor that would be slow if run sequentially completes in roughly the time of the longest single edit. The dependency graph (built in pure code from the planner's task list) decides which tasks have to wait for which.

A few things that tripped me up along the way:

  • Question-style requests overwriting files. The first version had no concept of read-only operations, so asking "how many lines does X have?" caused the executor to write the answer into the file. Fixed by adding an action_type: "query" field to the planner's output that routes through a separate code path that never touches disk.
  • Stale project maps causing silent misroutes. If the user named a file in their request that wasn't in the context map (because they just renamed it, or hadn't refreshed), the planner would silently route the action to the closest match. Now the orchestrator validates that mentioned file paths actually exist on disk and throws a clear error if they don't.
  • Markdown fences in executor output. Even when explicitly told not to, smaller models love wrapping code in triple backticks. Strip them in post-processing rather than fighting the prompt.
  • Memory token cost. Initially didn't budget for it; persistent memory is great but it's another ~80-90 tokens per entry that has to come out of the code budget. Now folder context is dropped first when the budget is tight, then memory, before the actual code gets cut.

What I'm still figuring out:

Whether the planner/executor split scales cleanly to codebases over 50 files. The dependency graph stays manageable, but the project map starts costing real tokens once you have enough folders. Currently dropping folder context first when budget is tight, but that means deeper edits get less context. Curious if anyone else has run into this and how they handle it.

Open-sourced the implementation if anyone wants to dig in: https://github.com/razvanneculai/litecode

submitted by /u/BestSeaworthiness283
[link] [comments]