A few months ago I wrote about context engineering - the invisible logic that keeps AI agents from losing their minds during long sessions. I described the patterns from the outside: keep the latest file versions, trim terminal output, summarize old tool results, guard the system prompt.
I also made a prediction: naive LLM summarization was a band-aid. The real work had to be deterministic curation. Summary should be the last resort.
Then Claude Code's repository surfaced publicly. I asked Claude to analyze its own compaction source code.
The prediction held. And the implementation is more thoughtful than I expected.
Three Tiers, Not One
Claude Code's compaction system isn't a single mechanism - it's three tiers applied in sequence, each heavier than the last.
Tier 1 runs before every API call. It does lightweight cleanup: clearing old tool results, keeping only the most recent five, replacing the rest with [Old tool result content cleared]. Fast, cheap, no model involved.
Tier 2 operates at the API level - server-side strategies that handle thinking blocks and tool result clearing based on token thresholds.
Tier 3 is the full LLM summarization. A structured 9-section summary: intent, technical concepts, files touched, errors and fixes, all user messages, pending tasks, current work. The model reasons through the conversation before committing to the summary - a chain-of-thought scratchpad that gets stripped afterward. It's sophisticated. It's also the last resort.
This architecture confirms exactly what the first article argued: summarization is expensive and lossy. You reach for it only when everything else has already run.
But Here's the Problem
My first instinct when reading about Tier 1 was: if the conversation is cached, deleting old messages invalidates the cache. And cache invalidation is brutally expensive - instead of a 90% discount on tokens, you're paying 1.25x for cache writes. You've just made compaction cost more than the tokens you saved.
So how does Claude Code solve this? The answer involves a mechanism called cache_edits that surgically removes tool results without touching the cached prefix, a summarization call that piggybacks on the main conversation's cache key (the alternative showed a 98% miss rate), and a reconstruction process that rebuilds the entire session state after compaction.
Read the full analysis on my blog →
The full post covers:
- How
cache_editspreserve the prompt cache during cleanup - Why the summarization call reuses your own cache key (and what happens when it doesn't)
- The complete post-compaction reconstruction process
- How cache economics shaped every architectural decision in the system




