Efficient, VRAM-Constrained xLM Inference on Clients

arXiv cs.LG / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper proposes “pipelined sharding,” a CPU-GPU hybrid scheduling method to run high-accuracy xLMs (LLMs and VLMs) on client devices under limited VRAM constraints without loss of accuracy.
It combines sub-layer model sharding, CPU offloading, pipelined copy-compute, and prioritized VRAM tensor placement to improve both time-to-first-token (TTFT) and tokens per second (TPS), while adapting to varying system and inference conditions.
For VLM workloads, it integrates pipelined sharding with a llama.cpp-based “VLMOpt” stack that uses vision tensor CPU offloading, flash attention, and VRAM overlap avoidance between vision and language components.
Evaluations targeting NVIDIA’s IGI SDK and Cosmos-Reason1 (CR1) report up to 6.7× TTFT and 30× TPS gains for LLM interactive inference, up to 10× lower CR1 VRAM demand, and up to 8.2× throughput improvement in batched mode.
The work is accepted for presentation at the 9th MLSys Conference (Industry Track) in 2026, with code and artifacts released on GitHub.

Abstract

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing software development kit (IGI SDK) and the Cosmos-Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7x and TPS by up to 30x for LLMs, and CR1 inference's VRAM demand is down by 10x, while in batched mode, throughput improves by up to 8.2x, all compared to their respective aggressive baselines. This paper is accepted at the 9th MLSys Conference (Industry Track), 2026. Code and artifact available at: https://github.com/deepshnv/pipeshard-mlsys26-ae

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices

Dev.to

Legora extends Series D to $600M with backing from Atlassian and NVentures, reaching $5.6B valuation

Tech.eu

Building an Al food tracker and currently tackling Apple Health integration. How do you prefer your „active calories“ to be handled?

Reddit r/artificial

The New Era of GEO: How Traffic Generator AI is Changing the Game

Dev.to

Data migration and modernization in 2025: why manual approaches are failing Global 2000 enterprises

Dev.to

Efficient, VRAM-Constrained xLM Inference on Clients

Key Points

Abstract

Related Articles

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices

Legora extends Series D to $600M with backing from Atlassian and NVentures, reaching $5.6B valuation

Building an Al food tracker and currently tackling Apple Health integration. How do you prefer your „active calories“ to be handled?

The New Era of GEO: How Traffic Generator AI is Changing the Game

Data migration and modernization in 2025: why manual approaches are failing Global 2000 enterprises

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer