Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (~125K context, higher TPS).
We’ve been pushing further on both context length and stability for tool-agent workloads.
Current results:
- ~218K context @ ~50 / 66 TPS (text, narr/code)
- ~198K + vision @ ~51 / 68 TPS
- tool calls with ~25K-token outputs now complete without OOM
So lower TPS than our earlier config, but significantly higher context + stability under real workloads.
---
### What changed
Previously, long tool outputs (~25K tokens) would consistently crash.
This turned out to be related to a Genesis patch (PN12) that was supposed to mitigate a memory issue, but wasn’t actually applying on vLLM dev205+:
- `apply_all` reported success
- but the underlying code path was unchanged
Root cause was anchor drift in the patch.
After fixing that, the tool-prefill OOM disappeared and higher context configs became usable.
Fix:
https://github.com/Sandermage/genesis-vllm-patches (PR #13)
---
### What we’re optimizing for
The goal here isn’t just max TPS or max context in isolation, but pushing both together on a single 3090:
- high context (200K+)
- usable throughput
- stable tool-agent workloads
---
### Notes / limitations
- There is still a second memory cliff around ~50–60K for single-prompt workloads on 1 GPU
- That one doesn’t apply with tensor parallelism (e.g. 2× 3090)
- Results depend heavily on quantization + config
---
### Repro
https://github.com/noonghunna/club-3090
---
Curious how others are balancing context vs TPS on 3090/4090 setups.
[link] [comments]



