llama.cppとvLLM向けの設定スイープCLIを作って試したところ、Qwen2.5-7BでQ4_K_MがQ8_0をTTFTで230ms上回った

Reddit r/LocalLLaMA / 2026/5/28

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical Usage

共有:

要点

Sigilant-sweepというオープンソースのCLIは、量子化、KVキャッシュ、コンテキスト長の16通りの組み合わせをllama.cppとvLLMで実行し、TPS/TTFTの統計（p50/p95）と固定の3,300トークン混合コーパス上でのPPLを加味して構成を評価します。
投稿者は、Modal L4環境でのQwen2.5-7BにおいてQ4_K_MがQ8_0をTTFTで約230ms上回り、加えてTPSや総合（重み付き）スコアでも改善が見られたと報告しています。
このツールは再現性の課題に対処するため、cyclic offsetによる決定論的シャッフルを採用しており、同一のハードウェア/バックエンドでは「勝者」の結果が約9/10の確率で安定するとしています。
CLIには信頼度の表示もあり、上位2つのギャップがノイズ内に収まる場合は低信頼として明示し、単発の結果を盲信せず追加の試行を促します。
さらに「depth profile」モードではプロンプト長（8k/14k/28k）ごとにTPSとTTFTを測定して、コンテキストが伸びるにつれて最適構成がどう変わるかを示します（PPLは同一固定コーパスで測定）。

Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B

I have been coming to this subreddit to understand what the optimal config is to run a model on a given hardware setup. I referred to specific benchmarks, but they are too generic and do not consider the underlying hardware. So, I decided to build the tool myself.

Sigilant-sweep is an OSS CLI that runs 16 configs (combinations of quants, KV cache, and context size) for a specified no. of trials. TPS and TTFT are measured every trial, along with PPL on a fixed 3,300 token mixed-domain corpus. After all the trials, each config gets p50 and p95 values for TPS and TTFT. These are normalised and combined into a final score, which is a weighted average based on the profile you select (balanced, latency, and quality).

The biggest challenge I faced was getting deterministic results. Initially, every run was showing a different winner. I tried multiple approaches and finally settled on deterministic shuffling through cyclic offset. This fixed the problem, and the results are now stable 9/10 times for a given hardware and backend.

Results: Qwen2.5-7B (bartowski) · Modal L4 · 16 configs · 15 trials

Config TPS p95 TTFT p95 PPL Score Q4_K_M · ctx:8192 · kv:k16v16 · best 74.5 1856ms 6.02 99 Q4_K_M · ctx:16384 · kv:k16v16 74.3 1869ms 6.02 98 Q5_K_M · ctx:8192 · kv:k16v16 71.5 2010ms 5.86 97 Q5_K_M · ctx:16384 · kv:k16v16 71.0 1950ms 5.86 97 Q8_0 · ctx:8192 · kv:k16v16 63.8 2130ms 5.82 92 Best vs Q8_0: TPS +10.7 · TTFT -274ms · PPL +0.20 · Score +7

Worth noting: Q4_K_M ctx:8192 and ctx:16384 are within 1% score. The CLI surfaces this explicitly and flags low confidence when the top-2 gap is within noise, so you know when to run more trials rather than blindly trusting a single winner.

There is also a depth profile mode that tests TPS and TTFT at 8k, 14k, and 28k prompt lengths to show which config is optimal as context grows. Perplexity stays on the same fixed corpus across all passes.

What it measures: TPS, TTFT, ITL, PPL

What it does not measure: Full quality (tool calling, str JSON validity etc.). There is a 5-sample smoke test, but it's not used in scoring yet.

Backends: llama.cpp and vLLM

Github: https://github.com/sigilantlabs/sigilant-sweep/

Feedback welcome

submitted by /u/diptanshu1991
[link] [comments]