2000 TPS with QWEN 3.5 27b on RTX-5090

Reddit r/LocalLLaMA / 3/14/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post reports achieving around 2000 TPS on a markdown document classification task using QWEN 3.5 27B (UD-Q5_K_XL.gguf) on an RTX-5090 with the official llama.cpp:server-cuda13 image.
In a 10-minute window, it processed 1,214,072 input tokens to produce 815 output tokens and classified 320 documents.
The speed gains came from disabling vision components, using a 'No thinking' mode, staying within VRAM, reducing the context size to 128k, and setting parallelism equal to the batch size of 8.
This configuration gives each batch item 16k of context and drops the less than 1% of larger documents for special processing.
The author notes the results are situational and not a full evaluation, but the initial sample looks very good.

I've been tuning my settings for a specific job that classifies markdown documents - lots of input tokens, no real caching because every doc is different and very few output tokens. So, these numbers are totally situational, but I thought I would share if anyone cares.

In the last 10 minutes it processed 1,214,072 input tokens to create 815 output tokens and classified 320 documents. ~2000 TPS

I'm pretty blown away because the first iterations were much slower.

I tried a bunch of different quants and setups, but these numbers are unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf using the official llama.cpp:server-cuda13 image.

The key things I set to make it fast were:

No vision/mmproj loaded. This is for vision and this use case does not require it.
Ensuring "No thinking" is used
Ensuring that it all fits in my free VRAM (including context during inference)
Turning down the context size to 128k (see previous)
Setting the parallelism to be equal to my batch size of 8

That gives each request in the batch 16k of context to work with and it kicks out the less than 1% of larger documents for special processing.

I haven't run the full set of evals yet, but a sample looks very good.

submitted by /u/awitod
[link] [comments]

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

2000 TPS with QWEN 3.5 27b on RTX-5090

Key Points

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer