Efficient, VRAM-Constrained xLM Inference on Clients
arXiv cs.LG / 4/30/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper proposes “pipelined sharding,” a CPU-GPU hybrid scheduling method to run high-accuracy xLMs (LLMs and VLMs) on client devices under limited VRAM constraints without loss of accuracy.
- It combines sub-layer model sharding, CPU offloading, pipelined copy-compute, and prioritized VRAM tensor placement to improve both time-to-first-token (TTFT) and tokens per second (TPS), while adapting to varying system and inference conditions.
- For VLM workloads, it integrates pipelined sharding with a llama.cpp-based “VLMOpt” stack that uses vision tensor CPU offloading, flash attention, and VRAM overlap avoidance between vision and language components.
- Evaluations targeting NVIDIA’s IGI SDK and Cosmos-Reason1 (CR1) report up to 6.7× TTFT and 30× TPS gains for LLM interactive inference, up to 10× lower CR1 VRAM demand, and up to 8.2× throughput improvement in batched mode.
- The work is accepted for presentation at the 9th MLSys Conference (Industry Track) in 2026, with code and artifacts released on GitHub.
Related Articles

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices
Dev.to

Legora extends Series D to $600M with backing from Atlassian and NVentures, reaching $5.6B valuation
Tech.eu
Building an Al food tracker and currently tackling Apple Health integration. How do you prefer your „active calories“ to be handled?
Reddit r/artificial
The New Era of GEO: How Traffic Generator AI is Changing the Game
Dev.to
Data migration and modernization in 2025: why manual approaches are failing Global 2000 enterprises
Dev.to