Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

Reddit r/LocalLLaMA / 4/15/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The post introduces a “hot expert cache” approach in llama.cpp that dynamically keeps frequently routed MoE expert tensors in VRAM to reduce the CPU↔GPU transfer overhead for token generation.
  • Using Qwen3.5-122B-A10B on an RTX 4090 (24GB) plus a Ryzen 9 7950X (96GB), the author reports 22.67 tok/s with the hot expert cache versus 15.65 tok/s when all experts run on CPU (+44.8%) and versus 17.87 tok/s with layer-based partial offload (+26.8%).
  • Prompt processing performance is broadly comparable, with the hot expert cache roughly neutral/slightly mixed, while layer-based offload shows a slight prompt-processing slowdown at similar VRAM usage.
  • The method works by tracking expert routing over a recent window (rebalance interval and bypass settings) and periodically “re-betting” which experts to keep hot in VRAM for the next segment of generation.
  • A repository link is provided for further details (code shared, binaries not yet available), positioning the change as an experimental optimization rather than a finalized release.

Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough. 23 tok/s is still rough but honestly noticeably faster when streaming responses.

Tl;dr:

  • We keep track of which experts get routed to most frequently for the past N tokens. We make a bet that the processing speed-up from loading these frequently routed-to experts into VRAM will outweigh the latency penalty for transferring expert tensors from system RAM (cold) into VRAM (hot). Rinse and repeat every N tokens.

First off, results:

  • vs. all-CPU experts baseline:
    • +44.8% token generation (15.65 tok/s -> 22.67 tok/s)
    • no prompt processing regression
  • vs. layer-based offload at equivalent VRAM commitment:
    • +26.8% token generation (17.87 tok/s -> 22.67 tok/s)
    • very slightly slower prompt processing

Baseline: All experts offloaded to CPU (LLAMA_ARG_OVERRIDE_TENSOR=exps=CPU)

  • Prompt processing (tok/s, n=2928): 514.93, 534.64, 531.26
  • Token generation (tok/s, n=~300): 15.60, 15.67, 15.69

Partial Layer Offload (22.6 GB VRAM used): 8 layers loaded on GPU (LLAMA_ARG_N_CPU_MOE = 40)

  • Prompt processing (tok/s, n=2929): 556.42, 581.73, 618.08
  • Token generation (tok/s, n=~300): 17.93, 17.81, 17.87

Hot expert cache (22.2 GB VRAM used): 44 expert slots in VRAM cache (LLAMA_ARG_MOE_HOT_K = 44, LLAMA_ARG_MOE_HOT_REBALANCE_INTERVAL=60, LLAMA_MOE_HOT_PP_BYPASS_N_TOKENS=64)

  • Prompt processing (tok/s, n=2929): 557.18, 542.76, 546.77
  • Token generation (tok/s, n=~300): 22.26, 22.97, 22.77

Setup:

  • RTX 4090 24GB + Ryzen 9 7950X 96GB
  • bartowski's Qwen3.5-122B-A10B Q4_K_L + bf16 vision mmproj
  • KV Cache 131K tokens @ Q8_0/Q8_0
  • For prompt processing, ubatch=3072 & batch=3072

Repo here with more details (code only for now, no binaries, still cooking): https://github.com/ParmesanParty/llama.cpp

submitted by /u/TriWrite
[link] [comments]