What's your tps on 3090 + Qwen 3.6 27B in real tasks?

Reddit r/LocalLLaMA / 5/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The author is trying to run a local agent for low-complexity coding/audit tasks using Qwen 3.6 27B on an RTX 3090 with very large context windows (e.g., 200k), but observes very low throughput deep in the context (around 10–11 tokens per second).
  • They report experimentation with various stacks (Tom’s turboquant/llama.cpp forks on Windows, WSL2 + vLLM with MTP/Genesis patches, and Luce DFlash), but run into issues such as OOM at practical context sizes, tool-related problems, and ineffective or broken implementations (including formatting/thinking behavior).
  • The author questions why community posts claim much higher TPS (e.g., 85–100 tps or ~30 tps) and asks whether those numbers are misleading due to being based on single-prompt benchmarks rather than multi-step, agentic back-and-forth.
  • They also ask whether speed degradation techniques (like MTP and DFlash) inherently worsen with longer contexts because prediction becomes harder when only part of the context is visible to the predictor.
  • They conclude by seeking clarification on whether the problem is primarily a “skill issue”/configuration problem or a more fundamental limitation of these approaches for real long-context coding agents.

I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models. So naturally this means a large context window, because low complexity can mean a simple-ish fix in a large codebase, rather than just generating some nonsense from zero.

So initially I went for Tom's turboquant plus fork of llama.cpp (I'm on Windows) with Qwen 3.6 Q4 and IQ4 models and 200k context window. Well it worked, it can read the entirety of example project I gave to it and make an audit (as much as it's capable of making it). But deep into context window the speed is just sad, like 10-11 tps, or even lower?

So I went into a rabbit hole of all the posts there all saying they have 85-100 tps on a single 3090 with a 5 billion context window or so. I've tried WSL2+vLLM with MTP and Genesis patches. Well it works in a sense that it launches but I'm OOM at any adequate context window and also it seems like there are tool issues and whatnot.

I've tried Luce DFlash solution and it turned out they didn't even have a working server solution. I made 2 PRs into it that fixed huge VRAM issues but then it turned out it doesn't format thinking right and can't use tools whatsoever. Oh well. Was fast in the "hi" chat at least.

Now I'm trying some other llama.cpp forks and modifying them to fix obvious issues they have, but at this point I have to question it all.

What's your tps on 3090 + Qwen 3.6 27B in real tasks? Like real coding tasks with many thousands of context, in proper harness? From what I read all these technologies like MTP and DFlash degrade very very fast with context as predicting correctly becomes very hard as the prediction model only sees a small part of the context at any time. Is that right?

But I also see people claiming they maintain like 30 tps on long chats. The "chats" is key there. All these benchmarks illustrate numbers based on feeding a model one prompt. Which is so so so much faster than multi-step chats. But in real agentic usage you often need this back-and-forth feedback.

And yes I do need thinking, it's crucial for coding tasks, but seems like it ruins prediction systems speed even further?

So tell me, is it skill issue or it really isn't as simple as these posts make it seem to be?

submitted by /u/Anbeeld
[link] [comments]