Qwen3.5:35B-A3B on RTX 5090 32GB - KV cache quantization or lower weight quant to fit parallel requests?

Reddit r/LocalLLaMA / 3/20/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The setup uses an RTX 5090 with 32GB VRAM, a Qwen3.5 35B model (~27GB), embeddings around 0.955GB, a 32768-token context, and 2 parallel requests, which leads to VRAM being fully utilized and the second user hanging.
Option A proposes KV cache quantization: enable Flash Attention and set KV cache to Q8_0 while keeping weights at Q4_K_M, saving about 2–3GB with negligible quality loss.
Option B proposes lower weight quantization to Q3_K_M, saving 3–4GB but potentially noticeable quality degradation on technical/structured tasks.
Option C proposes reducing the context window to 24k or 16k tokens, which frees memory but may hinder processing of long documents.
The author is seeking practical recommendations and asks if anyone has production experience running Qwen3.5 35B with KV cache Q8_0.

Running a small company AI assistant (V&V/RAMS engineering) on Open WebUI + Ollama with this setup:

GPU: RTX 5090 32GB VRAM
Model: Qwen3.5:35b (Q4_K_M) ~27GB
Embedding: nomic-embed-text-v2-moe ~955MB
Context: 32768 tokens
OLLAMA_NUM_PARALLEL: 2

The model is used by 4-5 engineers simultaneously through Open WebUI.
The problem: nvidia-smi shows 31.4GB/32.6GB used, full with one request. With NUM_PARALLEL=2, when two users query at the same time, the second one hangs until the first completes. The parallelism is set but can't actually work because there's no VRAM left for a second context window.

I need to free 2-3GB. I see two options and the internet is split on this:

Option A -> KV cache quantization: Enable Flash Attention + set KV cache to Q8_0. Model weights stay Q4_K_M. Should save ~2-3GB on context with negligible quality loss (0.004 perplexity increase according to some benchmarks).

Option B -> Lower weight quantization: Drop from Q4_K_M to Q3_K_M. Saves ~3-4GB on model size but some people report noticeable quality degradation, especially on technical/structured tasks.

Option C -> Reduce context window from 32k to 24k or 16k, keep everything else but it would be really tight, especially with long documents..

For context: the model handles document analysis, calculations, normative lookups, and code generation. Accuracy on technical data matters a lot.

What would you do? Has anyone run Qwen3.5 35B with KV cache Q8_0 in production?

submitted by /u/DjsantiX
[link] [comments]

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

日経XTECH

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".

Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development

Dev.to

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?

Dev.to

Building Production RAG Systems with PostgreSQL: Complete Implementation Guide

Dev.to

Qwen3.5:35B-A3B on RTX 5090 32GB - KV cache quantization or lower weight quant to fit parallel requests?

Key Points

Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".

Lessons from Academic Plagiarism Tools for SaaS Product Development

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?

Building Production RAG Systems with PostgreSQL: Complete Implementation Guide

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer