Dynamic expert caching PR in vLLM

Reddit r/LocalLLaMA / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author describes a dynamic expert caching PR in vLLM that enables running a 16G Mixture-of-Experts model on 8G VRAM by caching a subset of experts in VRAM and the rest in RAM using an LRU policy.
Cache misses trigger CPU-based computation while experts are reshuffled, reducing latency during MoE inference.
The update discusses planned enhancements, including mxfp4 and other quantization forms (beyond fp8 and bf16), streaming from disk, a two-tier cache, and better EP/DP integration for RAM-limited machines.
The author invites others to try the feature, review the PR, and notes potential applicability to other projects beyond their own use of vLLM.

After all the talk about hurrying up and waiting for MoE expert offloading, I went "fine I will vibe it myself".
Tested, reviewed, polished and tested again.

So now, I am running a 16G MoE model on 8G of VRAM.
This works by keeping a cache of a number experts in VRAM and the rest in RAM.
Cache is LRU, when cache miss occurs compute takes place in CPU while experts are being reshuffled so latency is reduced.
Please do give it a whirl and review.
https://github.com/vllm-project/vllm/pull/37190

Next PRs will add mxfp4 and other quantization forms (currently only fp8 and bf16), streaming from disk + two tier cache, for RAM restricted machines and a bunch of work for vLLM feature integration (EP/DP)

Do let me know if these features would be appreciated in other projects, currently I use vLLM exclusively so there was no need to look into them.

submitted by /u/king_of_jupyter
[link] [comments]