I Spent Four Weeks Reading 200+ Sources on Context Engineering. Here's What I Built.

Dev.to / 4/9/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author describes how AI coding agents (Claude Code/Cursor/Copilot) kept failing in production despite extensive CLAUDE.md files, including forgetting rules, running wrong commands, and touching restricted files.
  • After reading 200+ sources on context engineering over four weeks, they conclude that poorly designed context not only fails to help but can actively reduce success and increase cost, supported by multiple studies and production data claims.
  • They synthesize “8 laws” of effective context engineering, emphasizing principles like “less is more,” using landmines to block unsafe actions, favoring command snippets over prose, and respecting finite context limits.
  • They propose a layered approach to context via progressive disclosure (root repo context → subdirectory context → skills → MCP tools) and “hooks for determinism” for rules that must be followed 100% of the time.
  • The post is a launch announcement for nv:context, a Claude Code skill intended to set up context engineering for any repository in about three minutes.

A launch post for nv:context, a Claude Code skill that sets up context engineering for any repository in three minutes.

The wall I kept hitting

I build production Python services with AI coding agents. Claude Code, Cursor, Copilot, the whole rotation. And no matter how carefully I wrote my CLAUDE.md files, I kept hitting the same wall: the agent would forget rules mid-session, run the wrong test command, or touch files it shouldn't.

I did what most people do. I wrote longer CLAUDE.md files. Added more "don't do X" instructions. Tried /init. Nothing clicked.

Eventually I sat down to figure out why. Four weeks later, I had read 200+ sources on what the research calls context engineering. The picture was clearer than I expected, and uglier. Here's the punchline:

Bad context doesn't just not help. It actively hurts.

  • ETH Zurich found that auto-generated agent config files reduce success by 3% and increase costs 20%+
  • METR ran a controlled study on experienced developers and found they were 19% slower with AI tools when context was poorly managed, despite feeling 24% faster
  • FlowHunt / LongMemEval showed that a focused 300-token context outperforms an unfocused 113K-token context on the same task
  • Dex Horthy has shown that using 40% of the context window outperforms using 90%
  • Anthropic and Manus production data: below 60% context utilization is safe. At 70%, precision drops. At 85%, hallucinations begin.

The thing that shifted my thinking was Philipp Schmid's line:

"Most agent failures are not model failures. They are context failures."

The 8 laws that came out of it

  1. Less is more. Every line in your context competes with the actual task for attention.
  2. Landmines, not maps. Document what agents can't discover by reading the code.
  3. Commands beat prose. One snippet showing npm run test -- --coverage --maxWorkers=2 beats three paragraphs.
  4. Context is finite. Frontier LLMs follow roughly 150 to 200 instructions consistently.
  5. Progressive disclosure. Layer it: root CLAUDE.md, subdirectory CLAUDE.md, skills, MCP tools.
  6. Hooks for determinism. If a rule MUST be followed 100% of the time, use a hook.
  7. Negative instructions backfire. "Don't use moment.js" makes models more likely to use moment.js. Say "MUST use date-fns" instead.
  8. Compact proactively. Don't wait for Claude to compact at 95%. Update HANDOFF.md, run /clear, start fresh.

The hierarchy of leverage

Priority  Layer                    Compliance  Cost to set up
───────────────────────────────────────────────────
   1      Verification             100%        Medium
   2      CLAUDE.md / AGENTS.md    90-95%      Low
   3      Hooks                    100%        Low
   4      Skills                   ~79%        Medium
   5      Subagent patterns        Variable    Medium
   6      Session management       Manual      Low

Most people optimize from the bottom up. The best engineers start at the top.

What nv:context does

  1. Interviews you about your tools, pain points, landmines, and workflow preferences
  2. Scans your codebase with parallel subagents to find non-obvious patterns
  3. Scores your setup on all six leverage layers (0-10 per layer, 0-60 overall)
  4. Generates tailored configs for only the tools you actually use
  5. Sets up hooks for deterministic enforcement
  6. Creates session management infrastructure
  7. Installs compounding engineering (optional GitHub Action)

Production proof

selectools (Python SDK, 4,612 tests)

Starting state: L3 maturity, 49/60 leverage score, 440-line CLAUDE.md.

After: L5-L6, 58/60. CLAUDE.md went from 440 lines to 67 (-85%). Token budget dropped 53%.

nichevlabs (multi-product SaaS)

Starting state: L4 maturity, 17/60 leverage score.

The smoking gun: a 805-line SESSION.md that got loaded on every session start. 17,000 tokens. On every conversation. nv:context's token budget report made it impossible to ignore.

After: L6, 49/60 (up 32 points). SESSION.md went from 805 lines to 59 (-93%). Saved 15,800 tokens per session. A parallel bug-hunt subagent surfaced 81 real bugs while it was analyzing the codebase.

sheriff (Python + TypeScript)

Already-strong setup. L4 maturity, 36/60 leverage score going in.

After: L5, 42/60 (+6 points). Smaller delta than the others. Incremental polish, not a rewrite.

The through-line across all three: the skill is not a template generator. Same methodology, radically different outputs.

What it works with

The generated AGENTS.md is read by 25+ AI coding tools including Claude Code, Cursor, GitHub Copilot, Aider, Codeium, Continue, Windsurf, Zed, Gemini CLI, Cline. Tool-specific files only get generated for the tools you actually use.

Install

npx skills add johnnichev/nv-context -g -y

Then open any project and run /nv-context. Three-minute interview, thirty seconds of parallel analysis, done.

The research library

Full research library: https://skills.nichevlabs.com/research

Full synthesis (10 laws, 4 operations, 7-component context stack): https://skills.nichevlabs.com/synthesis

Primary sources include Anthropic engineering blog, Google DeepMind research, OpenAI Agents docs, ETH Zurich agent config paper, METR controlled developer study, JetBrains NeurIPS 2025 paper, Manus production data, GitHub's analysis of 2,500 public AGENTS.md files, Boris Cherny, and Dex Horthy.

Honest caveats

  • Three production repos is a small sample.
  • 60% token overhead on first run. First-run benchmark: 100% pass rate vs 45.8% baseline.
  • Research coverage is Python and JavaScript heavy. Rust, Go, Kotlin, and Elixir are thinner.
  • The skill is opinionated.

If you build AI coding agents for a living

Context engineering is the discipline that separates AI tools that work in demos from AI tools that work in production. If you have been writing longer CLAUDE.md files and things keep not quite working, try nv:context on your repo.

Two things launch alongside nv:context today. First, selectools, the Python agent framework I built that taught me I needed a methodology. Second, the landing page you just read about this methodology was built entirely with nv:design.