Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

Reddit r/LocalLLaMA / 5/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A native Windows setup is presented for running Qwen3.6-27B with vLLM (without WSL or Docker), packaged as a portable installer/launcher from an open-source GitHub project.
  • Reported performance on Windows 10 with an RTX 3090 includes 72 tok/s for short prompts, 64.5 tok/s for long prompts (~25k tokens), 53.4 tok/s at 127k context on a single GPU, and scaling to 160k context with PP=2 across two 3090 GPUs.
  • The project provides a patched vLLM fork for Windows and a launcher that first installs the bundled vLLM wheel into an embedded Python environment (one-time), then optionally auto-downloads the Lorbus AutoRound INT4 quant model from Hugging Face.
  • Usage is designed to be simple via a local OpenAI-compatible endpoint at http://127.0.0.1:5001/v1, after launching start.bat and selecting a snapshot.
  • Compatibility is stated for Ampere/Ada/Blackwell GPUs (e.g., 3090/4090/5090/A6000) and it is not expected to work on older Pascal/Turing/Arc or AMD GPUs.
Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

The angle here is native Windows, no WSL. Simple installation, open source, no telemetry. Not selling or promoting anything: https://github.com/devnen/qwen3.6-windows-server

Numbers (RTX 3090, Windows 10): - 72 tok/s short prompt - 64.5 tok/s long prompt (~25k tokens) - 53.4 tok/s at 127k ctx (single GPU) - 160k ctx on PP=2 (2×3090 GPUs)

Honestly, these aren't r/LocalLLaMA records. Community has hit 80–82 tok/s on a 3090 with TurboQuant 3-bit KV, and 160 tok/s on a 5090 on Linux. My launcher and patched vLLM closes that gap on Windows.

Simple installation: 1. Download qwen3.6-windows-server-portable-x64.zip from the Release 2. Unzip anywhere. No admin, no pip, no Python required 3. Double-click start.bat, pick a snapshot, hit Enter 4. OpenAI-compatible endpoint at http://127.0.0.1:5001/v1

I had to build a patched vLLM fork for Windows to fix a few issues and make this work. I am including a portable launcher that ships the prebuilt wheel.

First run installs the bundled vLLM wheel + deps into the embedded Python (~5–15 min, one-time), then offers to auto-download the Lorbus AutoRound INT4 quant from HuggingFace if you don't already have it. Subsequent launches skip straight to the TUI.

Tested on Windows 10 + 2× RTX 3090 with the Lorbus AutoRound INT4 quant. Should work on any Ampere/Ada/Blackwell card (3090/4090/5090/A6000). Won't work on Pascal, Turing, Arc, or AMD.

I have a similar launcher and a patched vLLM for Linux with some very competitive numbers, but it is still a work in progress.

If you're on a 3090/4090/5090 on Windows, give it a spin and post your numbers.

Full details, patches, benchmarks, and config snapshots: https://github.com/devnen/qwen3.6-windows-server

submitted by /u/One_Slip1455
[link] [comments]