AI Navigate

Qwen3.5:35B-A3B on RTX 5090 32GB - KV cache quantization or lower weight quant to fit parallel requests?

Reddit r/LocalLLaMA / 3/20/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The setup uses an RTX 5090 with 32GB VRAM, a Qwen3.5 35B model (~27GB), embeddings around 0.955GB, a 32768-token context, and 2 parallel requests, which leads to VRAM being fully utilized and the second user hanging.
  • Option A proposes KV cache quantization: enable Flash Attention and set KV cache to Q8_0 while keeping weights at Q4_K_M, saving about 2–3GB with negligible quality loss.
  • Option B proposes lower weight quantization to Q3_K_M, saving 3–4GB but potentially noticeable quality degradation on technical/structured tasks.
  • Option C proposes reducing the context window to 24k or 16k tokens, which frees memory but may hinder processing of long documents.
  • The author is seeking practical recommendations and asks if anyone has production experience running Qwen3.5 35B with KV cache Q8_0.

Running a small company AI assistant (V&V/RAMS engineering) on Open WebUI + Ollama with this setup:

  • GPU: RTX 5090 32GB VRAM
  • Model: Qwen3.5:35b (Q4_K_M) ~27GB
  • Embedding: nomic-embed-text-v2-moe ~955MB
  • Context: 32768 tokens
  • OLLAMA_NUM_PARALLEL: 2

The model is used by 4-5 engineers simultaneously through Open WebUI.
The problem: nvidia-smi shows 31.4GB/32.6GB used, full with one request. With NUM_PARALLEL=2, when two users query at the same time, the second one hangs until the first completes. The parallelism is set but can't actually work because there's no VRAM left for a second context window.

I need to free 2-3GB. I see two options and the internet is split on this:

Option A -> KV cache quantization: Enable Flash Attention + set KV cache to Q8_0. Model weights stay Q4_K_M. Should save ~2-3GB on context with negligible quality loss (0.004 perplexity increase according to some benchmarks).

Option B -> Lower weight quantization: Drop from Q4_K_M to Q3_K_M. Saves ~3-4GB on model size but some people report noticeable quality degradation, especially on technical/structured tasks.

Option C -> Reduce context window from 32k to 24k or 16k, keep everything else but it would be really tight, especially with long documents..

For context: the model handles document analysis, calculations, normative lookups, and code generation. Accuracy on technical data matters a lot.

What would you do? Has anyone run Qwen3.5 35B with KV cache Q8_0 in production?

submitted by /u/DjsantiX
[link] [comments]