Will llama.cpp multislot improve speed?

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post discusses whether llama.cpp’s “multislot” mode (using --parallel > 1) can improve inference speed, and contrasts it with vLLM’s parallel performance.
  • The author reports that vLLM significantly boosts decode throughput (e.g., ~150–170 tok/s on 1 slot with llama.cpp to ~400 tok/s on 4 slots with vLLM) when all slots/GPUs are fully utilized.
  • They argue that vLLM has practical limitations for some setups—particularly around CPU offload and compatibility/efficiency with GGUFs—leading to fewer quantization options (largely int4/int8) compared with llama.cpp.
  • Based on their long-running benchmarks (hours per run on 1 slot), the author wants to know whether using multiple slots with llama.cpp would shorten total benchmark time or yield similar runtimes due to bottlenecks.
  • Overall, the post frames the speed question as hardware- and configuration-dependent, emphasizing trade-offs between throughput gains and constraints like offload behavior and quantization support.

I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved the overall speed significantly (150-170tps decode on one slot llama.cpp to 400tps with 4-slot vLLM, of course when all 4 slots are used).

BUT vLLM handles CPU offload poorly (or I don't know how to use it properly) and, from what I heard, doesn't work with GGUFS too good, and thus, limits the available quantizations to basically int4/int8. And for many models I can easly run Q6 with llama.cpp and nice speed, but with vLLM I'd have to step down to int4 quants.

So, to the point... I'm running some benchmarks recently and on one-slot llama.cpp they easily take a couple hours or more per run. I'm wondering, if using multiple slots could actually reduce the time to complete the benchmark or it'd rather stay similar?

submitted by /u/Real_Ebb_7417
[link] [comments]