| I don't see any recent threads on this topic so posted this. As mentioned in title, KVCache taking too much Memory(Sometime even more than models' size during long context. Check Images for example). Since recent months, we're getting models supports up to 256K context base level & then extend it to 1 million using Yarn. Recent models like Qwen3-Next & Qwen3.5 series holding better with longer context without reducing speed much(comparing to other models). For models, at least we have this Pruning thing. I don't remember anything on KVCache side recently(Probably I'm ignorant of such solutions, please share if any). Even for 8B model, 40-55GB(Model - 8GB + KVCache - 32-45GB) memory required for 256K context. I see here most people do use 128K context at least for Agentic coding, Writing, etc., ..... I think 128-256K context is not that big anymore since 2026. So any upcoming solutions? Any Ongoing PRs? Deepseek working on this area possibly for their upcoming models? [link] [comments] |
KVCache taking too much Memory. Any solutions(Optimizations, Compressions, etc.,) coming soon/later?
Reddit r/LocalLLaMA / 3/23/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- A community discussion notes that KV cache memory usage for very long-context LLMs (e.g., 256K and up to ~1M tokens with techniques like RoPE scaling/Yarn) can become extremely large—sometimes exceeding the model weights’ memory footprint during long runs.
- The post highlights practical numbers for an 8B model where 256K context can require roughly 40–55GB total VRAM (model ~8GB plus KV cache ~32–45GB), motivating most users to prefer shorter contexts like 128K.
- While the writer mentions existing mitigation via KV cache pruning, they report not seeing recent or widely adopted KV-cache-specific optimization/compression techniques being discussed compared with model-side context-length scaling improvements.
- The thread asks whether upcoming solutions (optimizations, compressions, or related work—potentially from teams like DeepSeek) are expected to reduce KV cache memory growth for long-context inference.
- Overall, it signals a growing engineering constraint as local/agentic workloads push context length upward, making memory efficiency for KV cache a key near-term problem to solve.
Related Articles
The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M
Dev.to
How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command
Dev.to
Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets
Dev.to
What CVE-2026-25253 Taught Me About Building Safe AI Assistants
Dev.to
Vision and Hardware Strategy Shaping the Future of AI: From Apple to AGI and AI Chips
Dev.to