Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

Reddit r/LocalLLaMA / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The authors report improved ability to run Qwen3.6-27B on a single RTX 3090, reaching about 218K context with roughly 50–66 TPS for text and narr/code workloads.
Adding vision still maintains similar throughput (around 51–68 TPS at ~198K context plus vision), and long tool-call outputs (~25K tokens) now complete without OOM.
Although TPS is lower than in an earlier configuration, the setup delivers significantly higher context length and stability for real tool-agent workflows.
The earlier crashes were traced to a Genesis patch (PN12) that claimed to apply but didn’t take effect on vLLM dev205+ due to “anchor drift,” and a fix is provided via Genesis-vLLM patches (PR #13).
The team notes remaining limitations, including a second memory cliff around 50–60K context for single-prompt runs on one GPU, and that results depend heavily on quantization and configuration (with the cliff mitigated by tensor parallelism like 2×3090).

Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (~125K context, higher TPS).

We’ve been pushing further on both context length and stability for tool-agent workloads.

Current results:

- ~218K context @ ~50 / 66 TPS (text, narr/code)

- ~198K + vision @ ~51 / 68 TPS

- tool calls with ~25K-token outputs now complete without OOM

So lower TPS than our earlier config, but significantly higher context + stability under real workloads.

---

### What changed

Previously, long tool outputs (~25K tokens) would consistently crash.

This turned out to be related to a Genesis patch (PN12) that was supposed to mitigate a memory issue, but wasn’t actually applying on vLLM dev205+:

- `apply_all` reported success

- but the underlying code path was unchanged

Root cause was anchor drift in the patch.

After fixing that, the tool-prefill OOM disappeared and higher context configs became usable.

Fix:

https://github.com/Sandermage/genesis-vllm-patches (PR #13)

---

### What we’re optimizing for

The goal here isn’t just max TPS or max context in isolation, but pushing both together on a single 3090:

- high context (200K+)

- usable throughput

- stable tool-agent workloads

---

### Notes / limitations

- There is still a second memory cliff around ~50–60K for single-prompt workloads on 1 GPU

- That one doesn’t apply with tensor parallelism (e.g. 2× 3090)

- Results depend heavily on quantization + config

---

### Repro

https://github.com/noonghunna/club-3090

---

Curious how others are balancing context vs TPS on 3090/4090 setups.

submitted by /u/AmazingDrivers4u
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/1DailyView insight →

Black Hat USA

AI Business

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale

Microsoft Research Blog

langchain-fireworks==1.2.1

LangChain Releases

How PolySignals Works: Full Breakdown of Its AI Signal Engine

Dev.to

AI-Powered Prediction Market Signals: The Complete Polymarket Trading Guide for 2026

Dev.to

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale

langchain-fireworks==1.2.1

How PolySignals Works: Full Breakdown of Its AI Signal Engine

AI-Powered Prediction Market Signals: The Complete Polymarket Trading Guide for 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer