I've been tuning my settings for a specific job that classifies markdown documents - lots of input tokens, no real caching because every doc is different and very few output tokens. So, these numbers are totally situational, but I thought I would share if anyone cares.
In the last 10 minutes it processed 1,214,072 input tokens to create 815 output tokens and classified 320 documents. ~2000 TPS
I'm pretty blown away because the first iterations were much slower.
I tried a bunch of different quants and setups, but these numbers are unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf using the official llama.cpp:server-cuda13 image.
The key things I set to make it fast were:
- No vision/mmproj loaded. This is for vision and this use case does not require it.
- Ensuring "No thinking" is used
- Ensuring that it all fits in my free VRAM (including context during inference)
- Turning down the context size to 128k (see previous)
- Setting the parallelism to be equal to my batch size of 8
That gives each request in the batch 16k of context to work with and it kicks out the less than 1% of larger documents for special processing.
I haven't run the full set of evals yet, but a sample looks very good.
[link] [comments]




