Running a small company AI assistant (V&V/RAMS engineering) on Open WebUI + Ollama with this setup:
- GPU: RTX 5090 32GB VRAM
- Model: Qwen3.5:35b (Q4_K_M) ~27GB
- Embedding: nomic-embed-text-v2-moe ~955MB
- Context: 32768 tokens
- OLLAMA_NUM_PARALLEL: 2
The model is used by 4-5 engineers simultaneously through Open WebUI.
The problem: nvidia-smi shows 31.4GB/32.6GB used, full with one request. With NUM_PARALLEL=2, when two users query at the same time, the second one hangs until the first completes. The parallelism is set but can't actually work because there's no VRAM left for a second context window.
I need to free 2-3GB. I see two options and the internet is split on this:
Option A -> KV cache quantization: Enable Flash Attention + set KV cache to Q8_0. Model weights stay Q4_K_M. Should save ~2-3GB on context with negligible quality loss (0.004 perplexity increase according to some benchmarks).
Option B -> Lower weight quantization: Drop from Q4_K_M to Q3_K_M. Saves ~3-4GB on model size but some people report noticeable quality degradation, especially on technical/structured tasks.
Option C -> Reduce context window from 32k to 24k or 16k, keep everything else but it would be really tight, especially with long documents..
For context: the model handles document analysis, calculations, normative lookups, and code generation. Accuracy on technical data matters a lot.
What would you do? Has anyone run Qwen3.5 35B with KV cache Q8_0 in production?
[link] [comments]




