DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a major bottleneck in edge LLM inference: KV-cache sizes can exceed limited device memory, making offloading necessary but challenging.
DUAL-BLADE introduces a dual-path KV residency mechanism that routes KV tensors to either a kernel page-cache-backed path or an NVMe-direct path depending on real-time memory availability.
The NVMe-direct design bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, reducing thrashing, software overhead, and latency unpredictability.
By adding adaptive pipeline parallelism to overlap storage I/O with GPU DMA, DUAL-BLADE increases inference throughput.
Experiments report up to 33.1% lower prefill latency and 42.4% lower decode latency, alongside a 2.2x improvement in SSD utilization under varying memory budgets.

Abstract

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which often exceed available device memory. Although NVMe-based offloading offers scalable capacity, existing file-based designs rely heavily on the kernel page cache, leading to cache thrashing, unpredictable latency, and high software overhead under memory pressure. We present DUAL-BLADE, a dual-path KV residency framework that dynamically assigns KV tensors to either a page-cache path or an NVMe-direct path based on runtime memory availability. The NVMe-direct path bypasses the filesystem by mapping KV tensors to contiguous logical block address (LBA) regions, enabling low-overhead direct storage access. DUAL-BLADE further incorporates adaptive pipeline parallelism to overlap storage I/O with GPU DMA, improving inference throughput. Our evaluation shows that DUAL-BLADE substantially mitigates I/O bottlenecks, reducing prefill and decode latency by up to 33.1% and 42.4%, respectively, while improving SSD utilization by 2.2x across diverse memory budgets.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/30DailyView insight →

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Key Points

Abstract

💡 Insights using this article

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer