Qwen 3.6 + vLLM + Docker + 2x RTX 3090 setup, working great!

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A nonprofit organization deployed a dual-RJ RTX 3090 AI server and switched to vLLM to improve throughput for multiple simultaneous users.
  • The post shares a working Docker Compose setup using the vLLM OpenAI-compatible image, mounting the Hugging Face cache and exposing port 8000.
  • It configures the Qwen 3.6 35B AWQ 4-bit model with tensor parallelism across both GPUs, long context support (max model length 65,536), prefix caching, and tool/coder parsing options.
  • Benchmarks using llama-benchy show very high token generation rates for short-to-medium prompt setups (e.g., pp2048 at d2000/d32768/d63000) and significantly slower performance for longer/other test variants.
  • The author expresses satisfaction with the results and invites suggestions for further improvements to the deployment and performance tuning.

Our nonprofit association has an AI server with 2x RTX 3090 and I finally switched over to vLLM to get better performance for multiple users.

Here's my docker compose file:

services: vllm: image: vllm/vllm-openai:latest container_name: vllm deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] environment: - VLLM_API_KEY=my_very_secret_key_was_scrubbed volumes: - /opt/.cache/huggingface:/root/.cache/huggingface ports: - "8000:8000" ipc: host # Prevents shared memory bottlenecks during tensor parallelism command: > --model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit --tensor-parallel-size 2 --max-model-len 65536 --gpu-memory-utilization 0.85 --enable-prefix-caching --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --max-num-seqs 32 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' restart: unless-stopped 

I'm super happy with it, but if you have suggestions for improvements, let me know!

Here are my llama-benchy results:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit pp2048 @ d2000 5463.38 ± 111.87 748.82 ± 14.93 741.48 ± 14.93 748.93 ± 14.93
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit tg32 @ d2000 103.13 ± 22.06 112.49 ± 24.41
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit pp2048 @ d32768 5178.25 ± 25.55 6731.33 ± 33.06 6724.00 ± 33.06 6731.41 ± 33.05
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit tg32 @ d32768 25.65 ± 1.43 27.93 ± 1.52
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit pp2048 @ d63000 4534.72 ± 42.10 14353.15 ± 133.93 14345.82 ± 133.93 14353.26 ± 133.94
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit tg32 @ d63000 12.85 ± 3.50 14.45 ± 3.21
submitted by /u/Zyj
[link] [comments]