| I have been coming to this subreddit to understand what the optimal config is to run a model on a given hardware setup. I referred to specific benchmarks, but they are too generic and do not consider the underlying hardware. So, I decided to build the tool myself. Sigilant-sweep is an OSS CLI that runs 16 configs (combinations of quants, KV cache, and context size) for a specified no. of trials. TPS and TTFT are measured every trial, along with PPL on a fixed 3,300 token mixed-domain corpus. After all the trials, each config gets p50 and p95 values for TPS and TTFT. These are normalised and combined into a final score, which is a weighted average based on the profile you select (balanced, latency, and quality). The biggest challenge I faced was getting deterministic results. Initially, every run was showing a different winner. I tried multiple approaches and finally settled on deterministic shuffling through cyclic offset. This fixed the problem, and the results are now stable 9/10 times for a given hardware and backend. Results: Qwen2.5-7B (bartowski) · Modal L4 · 16 configs · 15 trials Worth noting: Q4_K_M ctx:8192 and ctx:16384 are within 1% score. The CLI surfaces this explicitly and flags low confidence when the top-2 gap is within noise, so you know when to run more trials rather than blindly trusting a single winner. There is also a depth profile mode that tests TPS and TTFT at 8k, 14k, and 28k prompt lengths to show which config is optimal as context grows. Perplexity stays on the same fixed corpus across all passes. What it measures: TPS, TTFT, ITL, PPL What it does not measure: Full quality (tool calling, str JSON validity etc.). There is a 5-sample smoke test, but it's not used in scoring yet. Backends: llama.cpp and vLLM Github: https://github.com/sigilantlabs/sigilant-sweep/ Feedback welcome [link] [comments] |
Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B
Reddit r/LocalLLaMA / 5/28/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage
Key Points
- An open-source CLI called Sigilant-sweep benchmarks multiple llama.cpp and vLLM configurations by running 16 combinations of quantization, KV cache, and context size, then computing TPS/TTFT statistics (p50/p95) plus PPL on a fixed 3,300-token mixed corpus.
- The author reports that for Qwen2.5-7B on a Modal L4 setup, Q4_K_M outperformed Q8_0 by about 230ms in TTFT while also improving overall weighted score, with TPS gains as well.
- The tool addresses reproducibility by enforcing deterministic shuffling via cyclic offset, producing stable “winner” results about 9/10 times for the same hardware/backend.
- The CLI includes confidence signaling: it explicitly flags low-confidence cases when the top-2 configuration gap is within noise, advising additional trials instead of trusting a single run.
- It also offers a “depth profile” mode that tests TPS and TTFT at varying prompt lengths (8k/14k/28k) to show how the optimal configuration changes as context grows, while PPL is measured on the same fixed corpus.
Related Articles

Black Hat USA
AI Business
YouTube adds new podcast features, including an AI recommendation tool and ‘Auto speed’
TechCrunch
![Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2F4apvcyd00w3h1.png%3Fwidth%3D140%26height%3D71%26auto%3Dwebp%26s%3D8123adc49485c56f1d2077e98aec74f2e306b23f&w=3840&q=75)
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Reddit r/MachineLearning
AiFinPay: Autonomous Payments for ruvnet/ruflo
Dev.to
AiFinPay: Autonomous Payments for cirosantilli/china-dictatorship
Dev.to