I wrote a PowerShell script to sweep llama.cpp MoE nCpuMoe vs batch settings

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A Reddit post describes a PowerShell script to sweep llama.cpp MOE nCpuMoe vs batch size to find a sweet spot for speed under VRAM constraints.
It performs a binary-search style sweep across MOE settings and batch sizes, benchmarking each run and tracking the best results per a chosen metric (e.g., time to finish, output quality, prompt processing).
The workflow uses llama bench under the hood and outputs a final top-5 table of runs, highlighting non-linear relationships between batch size and MOE performance.
The project is available on GitHub at DenysAshikhin/llama_moe_optimiser and the author asks for feedback if such tools already exist.

Hi all,

I have been playing around with Qwen 3.5 MOE models and found the sweetspot tradeoff between nCpuMoe and the batchsize for speed isn't linear.

I also kept rerunning the same tests across different quants, which got tedious.

If there is a tool/script that does this already, and I missed also let me know (I didn't find any).

How it works:

Start at your chosen lowest NCpuMoe and batch size
benchmark that as the baseline
Proceed to (using binary search) increase the batch size and run benchmarks
keep track of the best run (based on your selected metric, i.e. time to finish, output, prompt process)
Run through all min to max moe settings
show final table of the top 5 runs based on your selected metric

The whole thing uses the llama bench under the hood, but does a binary sweep while respecting the VRAM constraint.

Dev.to

Dev.to

Dev.to

Dev.to

Dev.to