AI Navigate

2000 TPS with QWEN 3.5 27b on RTX-5090

Reddit r/LocalLLaMA / 3/14/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post reports achieving around 2000 TPS on a markdown document classification task using QWEN 3.5 27B (UD-Q5_K_XL.gguf) on an RTX-5090 with the official llama.cpp:server-cuda13 image.
  • In a 10-minute window, it processed 1,214,072 input tokens to produce 815 output tokens and classified 320 documents.
  • The speed gains came from disabling vision components, using a 'No thinking' mode, staying within VRAM, reducing the context size to 128k, and setting parallelism equal to the batch size of 8.
  • This configuration gives each batch item 16k of context and drops the less than 1% of larger documents for special processing.
  • The author notes the results are situational and not a full evaluation, but the initial sample looks very good.

I've been tuning my settings for a specific job that classifies markdown documents - lots of input tokens, no real caching because every doc is different and very few output tokens. So, these numbers are totally situational, but I thought I would share if anyone cares.

In the last 10 minutes it processed 1,214,072 input tokens to create 815 output tokens and classified 320 documents. ~2000 TPS

I'm pretty blown away because the first iterations were much slower.

I tried a bunch of different quants and setups, but these numbers are unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf using the official llama.cpp:server-cuda13 image.

The key things I set to make it fast were:

  • No vision/mmproj loaded. This is for vision and this use case does not require it.
  • Ensuring "No thinking" is used
  • Ensuring that it all fits in my free VRAM (including context during inference)
  • Turning down the context size to 128k (see previous)
  • Setting the parallelism to be equal to my batch size of 8

That gives each request in the batch 16k of context to work with and it kicks out the less than 1% of larger documents for special processing.

I haven't run the full set of evals yet, but a sample looks very good.

submitted by /u/awitod
[link] [comments]