DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a major bottleneck in edge LLM inference: KV-cache sizes can exceed limited device memory, making offloading necessary but challenging.
  • DUAL-BLADE introduces a dual-path KV residency mechanism that routes KV tensors to either a kernel page-cache-backed path or an NVMe-direct path depending on real-time memory availability.
  • The NVMe-direct design bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, reducing thrashing, software overhead, and latency unpredictability.
  • By adding adaptive pipeline parallelism to overlap storage I/O with GPU DMA, DUAL-BLADE increases inference throughput.
  • Experiments report up to 33.1% lower prefill latency and 42.4% lower decode latency, alongside a 2.2x improvement in SSD utilization under varying memory budgets.

Abstract

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalable capacity, existing file-based designs rely heavily on the kernel page cache, leading to cache thrashing, unpredictable latency, and high software overhead under memory pressure. We present DUAL-BLADE, a dual-path KV residency framework that dynamically assigns KV tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability. The NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, enabling low-overhead direct storage access. DUAL-BLADE further incorporates adaptive pipeline parallelism to overlap storage I/O with GPU DMA, improving inference throughput. Our evaluation shows that DUAL-BLADE substantially mitigates I/O bottlenecks, reducing prefill and decode latency by up to 33.1% and 42.4%, respectively, while improving SSD utilization by 2.2x across diverse memory budgets.