| About the Model: Evaluation approach taken: 1,264 samples total. Evaluation Results: Hardware: 32 vCPU, 125GB RAM. No GPU. What This Means? Overall these are solid results for an active 3B MoE model running quantized on CPU. This entire evaluation was performed using Neo AI Engineer which researched various quant versions that could be run on the available CPU system and then using the correct chat template, building the consolidated eval harness for 3 benchmarks and reporting the final results after thorough review. [link] [comments] |
Qwen 3.6 35B A3B Q4_K_M quant evaluation
Reddit r/LocalLLaMA / 4/18/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The post evaluates Qwen 3.6 35B with a 3B-active MoE (A3B) using a Q4_K_M quantized GGUF from Unsloth, running entirely on CPU via llama-cpp-python.
- Testing used three benchmarks—HumanEval (code generation), HellaSwag (commonsense reasoning), and BFCL (function calling)—with 1,264 total samples.
- Reported results were 47.56% on HumanEval, 74.30% on HellaSwag, and 46.00% on BFCL, indicating stronger performance on commonsense tasks than on code and function calling.
- On the hardware setup (32 vCPU, 125GB RAM, no GPU), the quantized variant runs at about 22 tokens/second, described as a solid outcome for an active 3B MoE model at CPU scale.
- The evaluation was conducted with Neo AI Engineer to select compatible quantization versions for the available CPU and to build an end-to-end evaluation harness across the three benchmarks.




