| I tested Qwen3.5 27B with vLLM using the original bf16 version vs the Qwen made -fp8 quantization and using 8 bit KV cache vs the original 16 bit cache. I got practically identical results. I attribute the small difference to random noise as I only ran each once. The test was done using the Aider benchmark on a RTX 6000 Pro. My conclusion is that one should be using fp8 for both weights and cache. This will dramatically increase the amount of context available. [link] [comments] |
Qwen3.5-27b 8 bit vs 16 bit
Reddit r/LocalLLaMA / 3/17/2026
💬 OpinionSignals & Early TrendsTools & Practical Usage
Key Points
- The author compared Qwen3.5-27B with vLLM using the original bf16 version versus Qwen's -fp8 quantization, including an 8-bit KV cache versus the original 16-bit cache.
- Results were practically identical, with any small differences attributed to random noise since each run was performed only once.
- The test used the Aider benchmark on an RTX 6000 Pro.
- The conclusion is that fp8 should be used for both weights and cache to dramatically increase the amount of context available.
Related Articles

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.
Reddit r/LocalLLaMA
The Honest Guide to AI Writing Tools in 2026 (What Actually Works)
Dev.to
The Honest Guide to AI Writing Tools in 2026 (What Actually Works)
Dev.to
AI Cybersecurity
Dev.to
The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google
Dev.to