| The angle here is native Windows, no WSL. Simple installation, open source, no telemetry. Not selling or promoting anything: https://github.com/devnen/qwen3.6-windows-server Numbers (RTX 3090, Windows 10): - 72 tok/s short prompt - 64.5 tok/s long prompt (~25k tokens) - 53.4 tok/s at 127k ctx (single GPU) - 160k ctx on PP=2 (2×3090 GPUs) Honestly, these aren't r/LocalLLaMA records. Community has hit 80–82 tok/s on a 3090 with TurboQuant 3-bit KV, and 160 tok/s on a 5090 on Linux. My launcher and patched vLLM closes that gap on Windows. Simple installation: 1. Download I had to build a patched vLLM fork for Windows to fix a few issues and make this work. I am including a portable launcher that ships the prebuilt wheel. First run installs the bundled vLLM wheel + deps into the embedded Python (~5–15 min, one-time), then offers to auto-download the Lorbus AutoRound INT4 quant from HuggingFace if you don't already have it. Subsequent launches skip straight to the TUI. Tested on Windows 10 + 2× RTX 3090 with the Lorbus AutoRound INT4 quant. Should work on any Ampere/Ada/Blackwell card (3090/4090/5090/A6000). Won't work on Pascal, Turing, Arc, or AMD. I have a similar launcher and a patched vLLM for Linux with some very competitive numbers, but it is still a work in progress. If you're on a 3090/4090/5090 on Windows, give it a spin and post your numbers. Full details, patches, benchmarks, and config snapshots: https://github.com/devnen/qwen3.6-windows-server [link] [comments] |
Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer
Reddit r/LocalLLaMA / 5/2/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A native Windows setup is presented for running Qwen3.6-27B with vLLM (without WSL or Docker), packaged as a portable installer/launcher from an open-source GitHub project.
- Reported performance on Windows 10 with an RTX 3090 includes 72 tok/s for short prompts, 64.5 tok/s for long prompts (~25k tokens), 53.4 tok/s at 127k context on a single GPU, and scaling to 160k context with PP=2 across two 3090 GPUs.
- The project provides a patched vLLM fork for Windows and a launcher that first installs the bundled vLLM wheel into an embedded Python environment (one-time), then optionally auto-downloads the Lorbus AutoRound INT4 quant model from Hugging Face.
- Usage is designed to be simple via a local OpenAI-compatible endpoint at http://127.0.0.1:5001/v1, after launching start.bat and selecting a snapshot.
- Compatibility is stated for Ampere/Ada/Blackwell GPUs (e.g., 3090/4090/5090/A6000) and it is not expected to work on older Pascal/Turing/Arc or AMD GPUs.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Automating Patent Risk for Amazon FBA Sellers with AI
Dev.to

I Built a Unified AI Workspace Where Every Tool Shares Context — Meet Kit 🏵️
Dev.to

Internet Is Getting Remade For AI. What Does It Mean For You?
Reddit r/artificial