Claude Code cache confusion as Anthropic tweaks defaults, but quotas still drain
Dev reports suggest long sessions now burn through usage much faster
Anthropic last month reduced the TTL (time to live) for the Claude Code prompt cache from one hour to five minutes for many requests, but said this should not increase costs despite users reporting faster depleting quotas.
User Sean Swanson posted a bug report showing that Anthropic introduced a one-hour cache for Claude Code context around February 1, then changed it back to a five-minute cache around March 7. "The 5m TTL is disproportionately punishing for the long-session, high-context use case that defines Claude Code usage," said Swanson.
When using AI coding assistants or agents, the context is additional data sent along with the user's prompts, such as existing code or background instructions. Context improves the accuracy of the AI but also requires more processing.
Claude prompt caching avoids re-processing previously used prompts including context and background information. The cache can have either a five-minute or one-hour TTL. Writing to the five-minute cache costs 25 percent more in tokens, and writing to the one-hour cache 100 percent more, but reading from cache is around 10 percent of the base price.
Jarred Sumner, the creator of the Bun JavaScript runtime who now works for Anthropic, agreed that the analysis was "good detective work" but claimed that the change back to the five-minute cache made Claude Code cheaper because "a meaningful share of Claude Code's requests are one-shot calls where the cached context is used once and not revisited." Sumner said that the Claude Code client determines the cache TTL automatically and there are no plans for a global setting.
Swanson revised his analysis in response, agreeing that sessions using subagents do benefit from the lower write cost of the five-minute cache since they interact quickly and "their caches almost never expire." However, he said he has been a $200 per month subscriber for over six months and had never hit a quota limit until March. The "extra burn rate" is "making a once great service unusable," he said.
Another factor is that the large one-million-token context window available on paid plans with the Claude Opus 4.6 or Sonnet 4.6 models increases costs, especially with cache misses. Claude Code creator Boris Cherny said that "prompt cache misses when using 1M token context window are expensive... if you leave your computer for over an hour then continue a stale session, it's often a full cache miss." He said that Anthropic is investigating a 400,000-token context window by default, with an option for one million tokens if preferred. There is already a configuration setting for this.
Cherny said that larger contexts are now common because users are "pulling in a large number of skills, or running many agents or background automations."
- Anthropic goes nude, exposes Claude Code source by accident
- AMD's AI director slams Claude Code for becoming dumber and lazier since last update
- Claude Code bypasses safety rule if given too many commands
- Anthropic's Claude claws its way towards the top of the AI market
Some developers are convinced that cache rebuilding and cache misses are major factors in Claude Code quota exhaustion, which has reached the point where Pro users ($20 per month) may get as few as two prompts in five hours. A number of bugs in the caching code have been reported, such that one user said: "Before those are fixed likely any 5 minutes vs 1 h discussion is entirely moot since numbers are totally flawed."
The focus on cache optimization may also be evidence that, under the covers, Anthropic's quotas are simply buying less processing time than they did.
Swanson is not alone in reporting that Claude's performance has dropped. For example, a user on the enterprise team plan said: "In March I could use Opus all day and it was getting great results. Since the last week of March and into April, I've had sessions where I maxed out session usage under 2 hours and it got stuck in overthinking loops, multiple turns of realising the same thing, dozens of paragraphs of 'but wait, actually I need to do x' with slight variations." That chimes with similar comments from an AI director at AMD.
Cache optimization may be important, but it seems unlikely to account for all these reported issues. ®




