Qwen 3.6 + vLLM + Docker + 2x RTX 3090 setup, working great!

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A nonprofit organization deployed a dual-RJ RTX 3090 AI server and switched to vLLM to improve throughput for multiple simultaneous users.
The post shares a working Docker Compose setup using the vLLM OpenAI-compatible image, mounting the Hugging Face cache and exposing port 8000.
It configures the Qwen 3.6 35B AWQ 4-bit model with tensor parallelism across both GPUs, long context support (max model length 65,536), prefix caching, and tool/coder parsing options.
Benchmarks using llama-benchy show very high token generation rates for short-to-medium prompt setups (e.g., pp2048 at d2000/d32768/d63000) and significantly slower performance for longer/other test variants.
The author expresses satisfaction with the results and invites suggestions for further improvements to the deployment and performance tuning.

Our nonprofit association has an AI server with 2x RTX 3090 and I finally switched over to vLLM to get better performance for multiple users.

Here's my docker compose file:

services: vllm: image: vllm/vllm-openai:latest container_name: vllm deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] environment: - VLLM_API_KEY=my_very_secret_key_was_scrubbed volumes: - /opt/.cache/huggingface:/root/.cache/huggingface ports: - "8000:8000" ipc: host # Prevents shared memory bottlenecks during tensor parallelism command: > --model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit --tensor-parallel-size 2 --max-model-len 65536 --gpu-memory-utilization 0.85 --enable-prefix-caching --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --max-num-seqs 32 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' restart: unless-stopped

I'm super happy with it, but if you have suggestions for improvements, let me know!

Here are my llama-benchy results:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	pp2048 @ d2000	5463.38 ± 111.87		748.82 ± 14.93	741.48 ± 14.93	748.93 ± 14.93
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	tg32 @ d2000	103.13 ± 22.06	112.49 ± 24.41
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	pp2048 @ d32768	5178.25 ± 25.55		6731.33 ± 33.06	6724.00 ± 33.06	6731.41 ± 33.05
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	tg32 @ d32768	25.65 ± 1.43	27.93 ± 1.52
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	pp2048 @ d63000	4534.72 ± 42.10		14353.15 ± 133.93	14345.82 ± 133.93	14353.26 ± 133.94
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	tg32 @ d63000	12.85 ± 3.50	14.45 ± 3.21