DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
arXiv cs.AI / 4/30/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a major bottleneck in edge LLM inference: KV-cache sizes can exceed limited device memory, making offloading necessary but challenging.
- DUAL-BLADE introduces a dual-path KV residency mechanism that routes KV tensors to either a kernel page-cache-backed path or an NVMe-direct path depending on real-time memory availability.
- The NVMe-direct design bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, reducing thrashing, software overhead, and latency unpredictability.
- By adding adaptive pipeline parallelism to overlap storage I/O with GPU DMA, DUAL-BLADE increases inference throughput.
- Experiments report up to 33.1% lower prefill latency and 42.4% lower decode latency, alongside a 2.2x improvement in SSD utilization under varying memory budgets.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to
Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to
Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to
Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to