Gwen3.5-27b 8 bit vs 16 bit, 10 runs

Reddit r/LocalLLaMA / 3/19/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The Aider benchmark tested Gwen3.5-27b across four combinations of weights (bf16 vs fp8) and KV cache settings (bf16 vs fp8), with 10 runs per configuration to assess quantization impact on agentic coding.
Variance across runs was not statistically significant, suggesting the results are stable given the chosen experimental setup.
The benchmark involves 224 tasks, using about 2,375,980 prompt tokens and 613,762 completion tokens per typical run, averaging 13,300 tokens per task.
The author notes that fp8 quantization can be as good as bf16 in this setup, but fp8 cache may break down at longer context lengths and plans to explore 4-bit/5-bit configurations in the future.

The Aider benchmark on Qwen3.5-27b with the four combinations of model weights at bf16, fp8 and KV cache at bf16 and fp8. Each benchmark was repeated 10 times. The variance observed is not statistical significant.

FAQ:

Why not do 100 runs? Each run is 1+ hours and I have other projects. The variance is already too little and even if we did observe some small thing with a lot of runs, it might not actually mean anything.
Why the Aider benchmark? It sucks! Maybe - but I am researching for the specific purpose of agentic coding and I find the benchmark easy to use. The purpose is to find the impact of using a specific quantization, if any, not necessary to judge the model on the actual numbers.
Can you test 4 bit, 5 bit etc? Yes, I am planning to.
What did you set the context to? I did not set the context. It is not my benchmark. I am just a user.
But I demand you tell me what the context is! Ok fine. The Aider benchmark is 224 tasks. On a typical run it used 2375980 prompt tokens and 613762 completion tokens. That works out to an average of 13300 tokens per task.
That is not enough context for a good test! It might be if your use case is Aider. But anyway, I have an idea for how I might be able to artificially increase the context by filling in some garbage in the system prompt. I am going to try that.
You are an idiot for claiming fp8 is as good as bf16! I am claiming nothing. I am just sharing my findings. I know I am personally probably going to choose fp8 based on this, but you do you. Also many might be restrained from using the full model, but still be interested in knowing how much damage they suffer from using a quant.
This would be different if it was a knowledge based test. Maybe - I am considering finding a different benchmark to find out if that is the case. Although that is just because I am curious. My use case is agentic coding, so it wouldn't matter much to me.
fp8 cache breaks down at longer context lengths! That is a claim worth researching. I will work on it.
What was the test setup? vLLM in a Linux Podman container using the Nvidia RTX 6000 Pro workstation 600 watt GPU. Aider benchmark in a different Podman container.