Qwen 3.6 35B A3B Q4_K_M quant evaluation

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post evaluates Qwen 3.6 35B with a 3B-active MoE (A3B) using a Q4_K_M quantized GGUF from Unsloth, running entirely on CPU via llama-cpp-python.
  • Testing used three benchmarks—HumanEval (code generation), HellaSwag (commonsense reasoning), and BFCL (function calling)—with 1,264 total samples.
  • Reported results were 47.56% on HumanEval, 74.30% on HellaSwag, and 46.00% on BFCL, indicating stronger performance on commonsense tasks than on code and function calling.
  • On the hardware setup (32 vCPU, 125GB RAM, no GPU), the quantized variant runs at about 22 tokens/second, described as a solid outcome for an active 3B MoE model at CPU scale.
  • The evaluation was conducted with Neo AI Engineer to select compatible quantization versions for the available CPU and to build an end-to-end evaluation harness across the three benchmarks.
Qwen 3.6 35B A3B Q4_K_M quant evaluation

About the Model:
35B total parameters, 3B active (A3B) mixture of experts architecture.

Evaluation approach taken:
We took Q4_K_M quantized GGUF from Unsloth. Ran it on CPU via llama-cpp-python and tested on three standard benchmarks:
- HumanEval (code generation),
- HellaSwag (commonsense reasoning), and
- BFCL (function calling).

1,264 samples total.

Evaluation Results:
- HumanEval: 47.56% (78/164)
- HellaSwag: 74.30% (743/1000)
- BFCL: 46.00% (46/100)

Hardware:

32 vCPU, 125GB RAM. No GPU.

What This Means?
The Q4_K_M quantized variant runs at 22 tokens/sec on CPU delivering decent speed and performs best on commonsense reasoning at 74%. Code generation and function calling are harder tasks for this variant, landing in the mid 40s.

Overall these are solid results for an active 3B MoE model running quantized on CPU.

This entire evaluation was performed using Neo AI Engineer which researched various quant versions that could be run on the available CPU system and then using the correct chat template, building the consolidated eval harness for 3 benchmarks and reporting the final results after thorough review.

submitted by /u/gvij
[link] [comments]