Long-context coding on RTX 5080 16GB: Qwen3.6-35B-A3B holds 30 t/s at 128K (89 t/s fresh), no quality drop

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author benchmarks Qwen3.6-35B-A3B for long-context coding with a local Claude Code-style workflow on an RTX 5080 16GB GPU, aiming to avoid reliance on paid hosted tools.
Using a llama.cpp-based local server (via an Anthropic-compatible /v1/messages endpoint), they report sustained throughput of about 30 tokens per second at 128K context, with roughly 89 tokens per second under fresh prompt conditions.
The test is presented as a practical field log rather than a leaderboard, focusing on whether long-context code-agent usage remains workable on a single consumer GPU.
They document hardware and environment specifics (Windows 11, 96GB DDR5, CUDA 12.9.1 requirement, and compute-only RTX 5080) and describe why certain CUDA versions caused incorrect outputs or crashes.
For performance, they work from a TurboQuant CUDA fork and use Nsight Compute profiling to identify register-bound behavior in a kernel, then apply many small compilation/inlining/vectorization changes that cumulatively improve deep long-context speed without reported quality regressions.

I wanted to see how much of my coding-agent workflow I could move local instead of paying for hosted tools forever.

There was another push: Anthropic's own April 23 postmortem confirmed product-layer regressions through March/April. With a local model, what you benchmark is what you get.

The other constraint was context. I needed something that stayed usable at 65K–128K minimum.

I had an RTX 5080 16GB sitting idle most of the day. Qwen3.6 had been getting enough praise for coding that it seemed worth testing seriously. Claude Code can be pointed at a local Anthropic-compatible /v1/messages endpoint (Unsloth has a good guide on this), so the goal was simple: keep the Claude Code workflow, but serve the model from local llama.cpp.

This is not a leaderboard benchmark. It is a field log from trying to make long-context coding-agent work usable on one consumer GPU.

Hardware

RTX 5080 16GB (sm_120, consumer Blackwell GB203)
Ryzen 9700X (8c/16t)
96GB DDR5
Windows 11
iGPU drives the display, 5080 is compute-only
PCIe Gen 5 x16

One important note: CUDA 12.9.1 is mandatory on the fork I ended up using. CUDA 13.x produces garbage output and 13.1 segfaults in MMQ kernels. Learned that the hard way.

The fork

Not running mainline llama.cpp. I started with Madreag/turbo3-cuda (a TurboQuant CUDA fork in the TheTom/llama-cpp-turboquant lineage — TurboQuant adds TCQ / Trellis-Coded Quantization for the KV cache, ~3.125 bits per value). My patched fork is here: craftogrammer/llama.cpp-adaptive-turboquant. It worked fine at lower context around 64K, but speed dropped off hard at the longer context I was targeting and I wanted to understand why. So I profiled decode with ncu (Nsight Compute) on the dense 27B at d=65K. mul_mat_q<IQ3_S> ate 43% of profiled decode time. Dug deeper: 254 registers per thread, ~12.5% theoretical occupancy, DRAM throughput under 7%. The kernel is register-bound, not memory-bound — cp.async, prefetch, and pipelining tricks don't help. I tried two committed kernel changes (backtrace to shared memory, alignment fix) plus one local experiment (cp.async for MMQ tile loads, tested and reverted) and clean-rebenched each: +0.16% combined. Null. A series of smaller inlining and vectorization wins (V-dequant inline, byte-pair vectorization, minBlocks bump, inline scorer) did compound to +0.7% at d=0 scaling to +13% at d=64K — individually small, meaningful stacked at depth.

I also tested two ideas that I measured and rejected: a think-anchor mechanism (fp16 sink ranges anchored on reasoning tokens — measured −0.28% TG vs disabled, declined to ship) and a sparse-V threshold runtime knob (measured −32% decode regression, 20.4 vs 29.8 t/s, reverted to upstream-validated constant). Mentioning these because they took real time and the negative results are part of the honest picture.

Along the way I hit sm_120 ptxas issues: had to back off occupancy hints on FA vec kernels (higher minBlocks crashed the compiler). Some TCQ helpers must stay __noinline__, certain TUs need --ptxas-options=-O0. One thing easy to miss: prefetch.global.L2 lowers to CCTL.E.PF2 in SASS on sm_120 — grep for CCTL, not PRF.

Built on top of those findings, I patched the fork with adaptive KV mode selection, MoE offload tuning, and tight-VRAM fixes for RTX 5080 16GB.

First attempt: Qwen3.6-27B dense

This model looked like the natural fit for 16GB. Hybrid Transformer-Mamba, only 16/64 layers carry KV cache. Memory math looked fine on paper.

And at low context, it was fine. 40 t/s at empty context on a NEO-CODE IQ3_M quant. Usable.

Then I ran a depth sweep to see what actually happens as context grows:

Context depth	Decode (t/s)
0	40.5
16K	17.4
32K	10.6
65K	6.0
128K	3.2

3.2 tokens per second at 128K. In practice, Claude Code just felt painfully slow once conversations got long. Running the depth bench afterward explained why — the curve matched exactly what I was experiencing.

I spent days trying to tune this. Swept 9 combinations of ubatch size and thread count. The spread across all 9 was 0.46 t/s. Decode was completely bandwidth-locked. There was nothing to tune.

IQ3_M wasn't a quality choice — it was the only option that fit. Here's what the quant landscape looks like on 16GB at 131K context:

Quant	File size	Fits at 131K?
NEO-CODE IQ3_M (by DavidAU)	12.0 GiB	yes
UD-Q3_K_XL	13.5 GiB	yes (tight)
IQ4_XS	14.3 GiB	no (~1.6 GiB over)
Q4_K_S	14.8 GiB	no
IQ4_NL	15.0 GiB	no
Q4_K_M	15.7 GiB	no
Q5 / Q6	19+ GiB	5090 territory

Every Q4-class quant and above is out of reach on dense 27B + 16GB at usable context. IQ4_XS would need ~7 layers offloaded to CPU, which kills decode to ~5 t/s — defeats the purpose. So I was stuck at IQ3_M quality with a depth curve that made agent loops painful.

What finally pushed me to try the MoE path was a concrete coding test. I gave both models a restaurant bill splitter (integer paisa, exact-sum invariant, 4 test cases). The dense 27B wrote personSubtitles instead of personSubtotals three times — code doesn't even run. The 35B-A3B MoE wrote clean BigInt code that passed all 4 tests, in less wall time despite generating 54% more tokens. That was the moment I stopped trying to save the dense path.

Why I tested the MoE path

Can a model that doesn't fully fit on 16GB still be useful for long-context coding if you offload some experts to system RAM?

That is the regime I had not seen enough numbers for: consumer Blackwell, one 16GB GPU, long coding-agent context, partial MoE offload. So I tested it end-to-end instead of treating "35B total" as an automatic no.

Context depth	27B dense (old path)	35B-A3B MoE (final path)
0	40.5	91.8
16K	17.4	76.9
32K	10.6	54.1
65K	6.0	46.2
128K	3.2	30.4

Not a controlled single-variable comparison — I changed model, quant, offload split, and KV layout. The point is practical: dense wasn't usable at agent context, MoE became usable after tuning.

The offload balance is the whole game

On the UD-Q4_K_XL GGUF (20.81 GiB), the ncmoe sweep at d=16K:

ncmoe	tg32 (t/s)	Notes
40 (all CPU)	36.4	baseline
20	53.2
16	58.9	sweet spot for this file
12	36.1	hit VRAM cliff
8	5.9	catastrophic spill

The cliff is sharp. Sweet spot depends on GGUF file size vs available VRAM after KV allocation.

APEX-I-Compact (credit: mudler on Hugging Face) won because its smaller file (16.1 GiB vs 20.8 GiB) let me use ncmoe=8 instead of 16. That reduced PCIe pressure enough to matter:

Context depth	UD-Q4_K_XL (ncmoe=16)	APEX-I-Compact (ncmoe=8)
0	51.6	92.3
16K	58.9	75.9
32K	49.3	64.2
65K	39.4	48.0
128K	—	31.3

I also tested APEX-I-Quality (Q6_K, 21.25 GiB). It needed ncmoe=20 just to avoid VRAM thrashing. At that offload level it was the same speed as UD with the same quality on my shared test harness. No axis where it beat either keeper. Deleted it.

My coding benchmark was wrong (and yours might be too)

I initially thought UD was clearly better quality: 33/34 tests passed vs APEX-I-Compact's 29/32. A 6.5 percentage point gap.

Then I looked at what was actually happening. Each model was writing its own test suite AND its own implementation. A model that wrote 19 tests including 4 broken ones scored 15/19, while a model that wrote 11 clean tests scored 11/11. The benchmark was grading (implementation quality × test quality) and calling it implementation quality.

Specific bugs I found:

APEX-I-Compact had a real impl bug: b.priority was undefined because the subscription stored it as options.priority. Sort comparator returned NaN, no sorting happened.
APEX-I-Quality wrote 4 tests where a no-op handler was supposed to populate an array that was declared after the handler was removed. The tests were broken, not the implementation.
My prompt had a contradictory clause about snapshot-during-emit semantics that each model interpreted differently but consistently.

After fixing the prompt, pinning sampling to deterministic (temp=0, seed=42), and grading all three against a single shared 11-test harness:

Model	Decode t/s	Shared harness
UD-Q4_K_XL	64.5	11/11
APEX-I-Compact	86.7	11/11
APEX-I-Quality	53.4	11/11

The quality gap disappeared. The speed gap didn't.

If you're doing local coding evals: use a shared test harness, pin your sampling, and disambiguate your prompts. Self-written tests are not a quality signal.

The "compress everything" trap

One finding from my setup that may be worth testing elsewhere: more KV compression is not always faster at long context.

I tested different KV cache layouts on the fork — ranging from "compress all attention layers with TCQ" to "promote some K+V layers to q8_0." I'm intentionally not posting the exact mode map here because this is fork-specific and still changing. But the shape of the result:

KV layout	d=0	d=16K	d=32K	d=65K	d=128K
All compressed	86.8	55.2	42.3	28.3	16.6
Hybrid (some layers q8_0)	91.8	76.9	54.1	46.2	30.4

At 128K, the hybrid layout is nearly 2x faster than full compression.

I don't have a proven explanation for why. My working hypothesis is that TCQ codebook lookup overhead grows linearly with K reads, and at deeper context you're paying more per-read cost. Promoting the most-accessed layers to q8_0 avoids that where it matters most. Whatever the cause, the measured result is clear: if you're running any TCQ or compressed KV scheme, test at your actual working context depth, not d=0.

To avoid manually picking a layout, I wrote an auto-selector: at cache allocation it probes free VRAM via ggml_backend_dev_memory, estimates each layout's KV size with the same ggml_row_size formula the allocator uses, and picks the most aggressive mode that fits under free VRAM minus a 1 GiB compute-peak margin. Verified: predicted 1510 MiB, actual allocation 1509.88 MiB. On bigger cards it stays aggressive; on tight VRAM it falls back automatically. Override with TURBO_LAYER_ADAPTIVE=N if you want manual control.

Where it is now

Daily driver config:

Model: Qwen3.6-35B-A3B APEX-I-Compact (16.10 GiB)
Fork: craftogrammer/llama.cpp-adaptive-turboquant, CUDA 12.9.1, sm_120
Offload: 8 expert layers on CPU (--n-cpu-moe 8)
Context: 131072 (128K)
KV: turbo3_tcq with auto-selected hybrid layout
Sampling: temp=0.6, top_p=0.95, top_k=20

Claude Code talks to this through ANTHROPIC_BASE_URL=http://127.0.0.1:8080. Server-side log from one real request: 1078-token prompt prefilled at 1582 t/s, 538-token decode at 90.7 t/s.

VRAM sits at ~13.3 / 16.0 GB during sustained 128K decode. Tight but no spill.

Prompt cache (--cache-ram -1) makes agent loops much faster after the first turn: cold prefill of a 23K-token prompt takes ~13s at 1787 t/s, but subsequent turns with similar prefix only re-prefill the delta at 419–569 t/s. One gotcha on hybrid Mamba+Attention: any prefix mismatch — even a dynamic timestamp or request ID — forces full re-prefill because the SSM state can't partially roll back.

Fallback if I hit real-world regressions: UD-Q4_K_XL at ncmoe=16, ~62 t/s, same quality on shared harness.

The ceiling is the hardware

PCIe Gen 5 x16 hits ~89% saturation during MoE decode (56–61 GB/s burst against a ~63 GB/s theoretical ceiling). SM utilization sits at 93–97%. I don't see obvious tuning headroom left in this regime.

39–48 t/s at d=65K and ~30 t/s at d=128K is what this hardware does. Getting past 50 t/s sustained at long context needs more VRAM (fewer experts on CPU = less PCIe traffic), not more clever kernels. Waiting for 5090 at MSRP whenever that happens.

If you want to try this on your 16GB card

The short version: grab Qwen3.6-35B-A3B in a ~16 GiB GGUF (APEX-I-Compact worked for me) and sweep ncmoe at your target context depth — not at d=0. The sweet spot is narrow and file-size-dependent. On my 5080 it's ncmoe=8 for the 16 GiB file and ncmoe=16 for the 21 GiB file.

If you're using a TurboQuant-derived fork with compressed KV, test at your real working depth. I found that full compression was nearly 2x slower than a hybrid layout at 128K — d=0 benchmarks won't tell you that.

One thing worth preempting since it just landed: I benched mainline NVFP4 (b8967) same day it shipped. 15–16 t/s vs 39–51 t/s on the fork in MoE+offload. GitHub #18250 closed "not planned."

What I learned

Measure at your actual working context depth, not d=0. Agent context grows fast and d=0 speed is not predictive. The depth curve is the hardware talking — I spent days trying to tune around it before accepting it was a PCIe ceiling, not a configuration problem.

On 16GB, file size matters more than quant quality. A smaller GGUF that lets you keep more experts on GPU will beat a "better" quant that forces worse offload balance. Quality was identical on shared deterministic tests.

And if you're running local coding evals: use a shared test harness, pin your sampling, and disambiguate your prompts. I thought one model was 6.5pp better until I realized each model was grading itself on its own self-written tests. The gap disappeared the moment I used a shared harness.

English isn't my first language — I used Claude to help write this post. All data, measurements, benchmarks, and technical conclusions are from my own testing on my own hardware.

submitted by /u/craftogrammer
[link] [comments]