GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post reports that TheTom’s turboquant GPU-accelerated turboquant (turbo3) enables high-context performance for the Qwen3.6 35BA3B family, reaching about ~40 tok/s on a 3080.
It provides a specific llama-server build and runtime configuration using CUDA with multiple quantization-related flags (including FA_ALL_QUANTS and CUDA F16) and turbo3 KV cache settings.
The setup targets a ~260k context window (stated as “260k ctx”) and uses Qwen3.6 35B A3B GGUF loading with Q4_K_M quantization.
The author notes using “reasoning off” and Qwen-recommended sampling/serving settings to improve time-to-first-acceptable-solution with a staged prompt harness (ask → validate → review → refine/accept).
Overall, the article functions as a practical tuning recipe for running long-context Qwen3.6 models on relatively limited GPU VRAM.
Point 2

The TheTom's turboquant's GPU accelerated turboquant (turbo3) has unlocked high context gains for the 35BA3B family.

I can now achieve ~40tg/s via the following GPU-POOR compilation flags and configuration:

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FORCE_MMQ=ON ./local/bin/llama-cpp-turboquant/llama-server \ --alias 'Qwen3-6-35B-A3B-turbo' \ --ctx-size 0 \ --fit on \ --no-mmproj \ --jinja \ --flash-attn on \ --cache-type-k turbo3 \ --cache-type-v turbo3 \ --reasoning off \ -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0

This is using the qwen3.6 recommended settings for thinking off, as I find the time-to-first-acceptable-solution is better with a prompt harness that has stages: ask, validate, review, refine/accept.

submitted by /u/herpnderpler
[link] [comments]