Claude Code's Compaction Engine: What the Source Code Actually Reveals

Dev.to / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

Claude Code’s compaction engine uses a three-tier sequence—lightweight pre-call cleanup, server-side token-threshold handling, and an LLM-based 9-section summary only as a last resort.
The analysis argues and then confirms that deterministic curation (clearing/keeping recent tool outputs) is the “real work,” while naive summarization is costly and lossy.
A key complication is prompt caching: naive message deletion can invalidate cached prefixes and increase effective cost, so Claude Code avoids this.
Claude Code’s approach uses a mechanism called cache_edits to surgically remove tool results without breaking the cached prefix, plus cache-key reuse for the summarization call and a reconstruction step to rebuild session state after compaction.

A few months ago I wrote about context engineering - the invisible logic that keeps AI agents from losing their minds during long sessions. I described the patterns from the outside: keep the latest file versions, trim terminal output, summarize old tool results, guard the system prompt.

I also made a prediction: naive LLM summarization was a band-aid. The real work had to be deterministic curation. Summary should be the last resort.

Then Claude Code's repository surfaced publicly. I asked Claude to analyze its own compaction source code.

The prediction held. And the implementation is more thoughtful than I expected.

Three Tiers, Not One

Claude Code's compaction system isn't a single mechanism - it's three tiers applied in sequence, each heavier than the last.

Tier 1 runs before every API call. It does lightweight cleanup: clearing old tool results, keeping only the most recent five, replacing the rest with [Old tool result content cleared]. Fast, cheap, no model involved.

Tier 2 operates at the API level - server-side strategies that handle thinking blocks and tool result clearing based on token thresholds.

Tier 3 is the full LLM summarization. A structured 9-section summary: intent, technical concepts, files touched, errors and fixes, all user messages, pending tasks, current work. The model reasons through the conversation before committing to the summary - a chain-of-thought scratchpad that gets stripped afterward. It's sophisticated. It's also the last resort.

This architecture confirms exactly what the first article argued: summarization is expensive and lossy. You reach for it only when everything else has already run.

But Here's the Problem

My first instinct when reading about Tier 1 was: if the conversation is cached, deleting old messages invalidates the cache. And cache invalidation is brutally expensive - instead of a 90% discount on tokens, you're paying 1.25x for cache writes. You've just made compaction cost more than the tokens you saved.

So how does Claude Code solve this? The answer involves a mechanism called cache_edits that surgically removes tool results without touching the cached prefix, a summarization call that piggybacks on the main conversation's cache key (the alternative showed a 98% miss rate), and a reconstruction process that rebuilds the entire session state after compaction.

Read the full analysis on my blog →

The full post covers:

How cache_edits preserve the prompt cache during cleanup
Why the summarization call reuses your own cache key (and what happens when it doesn't)
The complete post-compaction reconstruction process
How cache economics shaped every architectural decision in the system