The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

arXiv cs.AI / 3/12/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep Analysis

共有:

Key Points

dmaplane is a Linux kernel module that exposes a stable UAPI at /dev/dmaplane to explicitly manage DMA buffer lifecycles and orchestration for AI data paths.
It provides ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel RDMA engine, NUMA-aware allocation/verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning.
The paper evaluates orchestration sensitivity with NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy.
It also demonstrates end-to-end disaggregated inference by transferring KV-cache chunks between two machines via RDMA WRITE WITH IMMEDIATE and reconstructing tensor views on the receiver, using Soft-RoCE for measurements.

Abstract

AI transport libraries move bytes efficiently, but they commonly assume that buffers are already correctly allocated, placed, shared, registered, and safe under completion and teardown pressure. This paper presents dmaplane, a Linux kernel module that makes this missing layer explicit as buffer orchestration. dmaplane exposes a stable kernel UAPI via /dev/dmaplane and composes ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel-space RDMA engine, NUMA-aware allocation and verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning. We evaluate orchestration sensitivity with measurements of NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy. We also demonstrate end-to-end disaggregated inference by transferring KV-cache chunks between two machines using RDMA WRITE WITH IMMEDIATE and reconstructing tensor views on the receiver. RDMA measurements use Soft-RoCE; we distinguish measured results from provider-independent properties by construction.

The programming passion is melting

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

Dev.to

Nvidia GTC 2026: Jensen Huang Eyes $1 Trillion in Orders as the AI Infrastructure Race Hits Warp Speed

Dev.to

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

Key Points

Abstract

Related Articles

The programming passion is melting

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Nvidia GTC 2026: Jensen Huang Bets $1 Trillion on the Age of the AI Factory

Nvidia GTC 2026: Jensen Huang Eyes $1 Trillion in Orders as the AI Infrastructure Race Hits Warp Speed

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer