Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B

Reddit r/LocalLLaMA / 5/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

An open-source CLI called Sigilant-sweep benchmarks multiple llama.cpp and vLLM configurations by running 16 combinations of quantization, KV cache, and context size, then computing TPS/TTFT statistics (p50/p95) plus PPL on a fixed 3,300-token mixed corpus.
The author reports that for Qwen2.5-7B on a Modal L4 setup, Q4_K_M outperformed Q8_0 by about 230ms in TTFT while also improving overall weighted score, with TPS gains as well.
The tool addresses reproducibility by enforcing deterministic shuffling via cyclic offset, producing stable “winner” results about 9/10 times for the same hardware/backend.
The CLI includes confidence signaling: it explicitly flags low-confidence cases when the top-2 configuration gap is within noise, advising additional trials instead of trusting a single run.
It also offers a “depth profile” mode that tests TPS and TTFT at varying prompt lengths (8k/14k/28k) to show how the optimal configuration changes as context grows, while PPL is measured on the same fixed corpus.

Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B

I have been coming to this subreddit to understand what the optimal config is to run a model on a given hardware setup. I referred to specific benchmarks, but they are too generic and do not consider the underlying hardware. So, I decided to build the tool myself.

Sigilant-sweep is an OSS CLI that runs 16 configs (combinations of quants, KV cache, and context size) for a specified no. of trials. TPS and TTFT are measured every trial, along with PPL on a fixed 3,300 token mixed-domain corpus. After all the trials, each config gets p50 and p95 values for TPS and TTFT. These are normalised and combined into a final score, which is a weighted average based on the profile you select (balanced, latency, and quality).

The biggest challenge I faced was getting deterministic results. Initially, every run was showing a different winner. I tried multiple approaches and finally settled on deterministic shuffling through cyclic offset. This fixed the problem, and the results are now stable 9/10 times for a given hardware and backend.

Results: Qwen2.5-7B (bartowski) · Modal L4 · 16 configs · 15 trials

Config TPS p95 TTFT p95 PPL Score Q4_K_M · ctx:8192 · kv:k16v16 · best 74.5 1856ms 6.02 99 Q4_K_M · ctx:16384 · kv:k16v16 74.3 1869ms 6.02 98 Q5_K_M · ctx:8192 · kv:k16v16 71.5 2010ms 5.86 97 Q5_K_M · ctx:16384 · kv:k16v16 71.0 1950ms 5.86 97 Q8_0 · ctx:8192 · kv:k16v16 63.8 2130ms 5.82 92 Best vs Q8_0: TPS +10.7 · TTFT -274ms · PPL +0.20 · Score +7

Worth noting: Q4_K_M ctx:8192 and ctx:16384 are within 1% score. The CLI surfaces this explicitly and flags low confidence when the top-2 gap is within noise, so you know when to run more trials rather than blindly trusting a single winner.

There is also a depth profile mode that tests TPS and TTFT at 8k, 14k, and 28k prompt lengths to show which config is optimal as context grows. Perplexity stays on the same fixed corpus across all passes.

What it measures: TPS, TTFT, ITL, PPL

What it does not measure: Full quality (tool calling, str JSON validity etc.). There is a 5-sample smoke test, but it's not used in scoring yet.

Backends: llama.cpp and vLLM

Github: https://github.com/sigilantlabs/sigilant-sweep/

Feedback welcome

submitted by /u/diptanshu1991
[link] [comments]

Black Hat USA

AI Business

YouTube adds new podcast features, including an AI recommendation tool and ‘Auto speed’

TechCrunch

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Reddit r/MachineLearning

AiFinPay: Autonomous Payments for ruvnet/ruflo

Dev.to

AiFinPay: Autonomous Payments for cirosantilli/china-dictatorship

Dev.to

Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B

Key Points

Related Articles

Black Hat USA

YouTube adds new podcast features, including an AI recommendation tool and ‘Auto speed’

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

AiFinPay: Autonomous Payments for ruvnet/ruflo

AiFinPay: Autonomous Payments for cirosantilli/china-dictatorship

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer