| Hi all, I have been playing around with Qwen 3.5 MOE models and found the sweetspot tradeoff between nCpuMoe and the batchsize for speed isn't linear. I also kept rerunning the same tests across different quants, which got tedious. If there is a tool/script that does this already, and I missed also let me know (I didn't find any). How it works:
The whole thing uses the llama bench under the hood, but does a binary sweep while respecting the VRAM constraint. If interested you can find it here: https://github.com/DenysAshikhin/llama_moe_optimiser [link] [comments] |
I wrote a PowerShell script to sweep llama.cpp MoE nCpuMoe vs batch settings
Reddit r/LocalLLaMA / 3/22/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage
Key Points
- A Reddit post describes a PowerShell script to sweep llama.cpp MOE nCpuMoe vs batch size to find a sweet spot for speed under VRAM constraints.
- It performs a binary-search style sweep across MOE settings and batch sizes, benchmarking each run and tracking the best results per a chosen metric (e.g., time to finish, output quality, prompt processing).
- The workflow uses llama bench under the hood and outputs a final top-5 table of runs, highlighting non-linear relationships between batch size and MOE performance.
- The project is available on GitHub at DenysAshikhin/llama_moe_optimiser and the author asks for feedback if such tools already exist.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)
Dev.to
Agentforce Builder: How to Build AI Agents in Salesforce
Dev.to
How AI Consulting Services Support Staff Development in Dubai
Dev.to