Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article compares Qwen 3.6 27B performance across BF16 and two GGUF quantized formats (Q4_K_M and Q8_0) using llama-cpp-python on Neo AI Engineer.
  • Across HumanEval, HellaSwag, and BFCL (function calling), BF16 achieves the best overall accuracy, but Q4_K_M delivers a close practical alternative with much lower resource usage.
  • Q4_K_M shows nearly identical BFCL scores to BF16 (63.0–63.25%) while reducing peak RAM from 54 GB (BF16) to 28 GB and shrinking the model file to 16.8 GB.
  • Q8_0 performs less favorably in this run: it is slower and uses more peak RAM than Q4_K_M, with lower HellaSwag results despite a slight improvement in HumanEval.
  • For local/CPU deployments, the piece recommends Q4_K_M by default unless the workload is heavily code-generation focused, while BF16 remains the choice for maximum quality.
Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer.

Benchmarks used:

  • HumanEval: code generation
  • HellaSwag: commonsense reasoning
  • BFCL: function calling

Total samples:

  • HumanEval: 164
  • HellaSwag: 100
  • BFCL: 400

Results:

BF16

  • HumanEval: 56.10% 92/164
  • HellaSwag: 90.00% 90/100
  • BFCL: 63.25% 253/400
  • Avg accuracy: 69.78%
  • Throughput: 15.5 tok/s
  • Peak RAM: 54 GB
  • Model size: 53.8 GB

Q4_K_M

  • HumanEval: 50.61% 83/164
  • HellaSwag: 86.00% 86/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.54%
  • Throughput: 22.5 tok/s
  • Peak RAM: 28 GB
  • Model size: 16.8 GB

Q8_0

  • HumanEval: 52.44% 86/164
  • HellaSwag: 83.00% 83/100
  • BFCL: 63.00% 252/400
  • Avg accuracy: 66.15%
  • Throughput: 18.0 tok/s
  • Peak RAM: 42 GB
  • Model size: 28.6 GB

What stood out:

Q4_K_M looks like the best practical variant here. It keeps BFCL almost identical to BF16, drops about 5.5 points on HumanEval, and is still only 4 points behind BF16 on HellaSwag.

The tradeoff is pretty good:

  • 1.45x faster than BF16
  • 48% less peak RAM
  • 68.8% smaller model file
  • nearly identical function calling score

Q8_0 was a bit underwhelming in this run. It improved HumanEval over Q4_K_M by ~1.8 points, but used 42 GB RAM vs 28 GB and was slower. It also scored lower than Q4_K_M on HellaSwag in this eval.

For local/CPU deployment, I would probably pick Q4_K_M unless the workload is heavily code-generation focused. For maximum quality, BF16 still wins.

Evaluation setup:

  • GGUF via llama-cpp-python
  • n_ctx: 32768
  • checkpointed evaluation
  • HumanEval, HellaSwag, and BFCL all completed
  • BFCL had 400 function calling samples

This evaluation was done using Neo AI Engineer, which built the GGUF eval setup, handled checkpointed runs, and consolidated the benchmark results. I manually reviewed the outcome as well.

Complete case study with benchmarking results, approach and code snippets in mentioned in the comments below 👇

submitted by /u/gvij
[link] [comments]