Best config for Qwen3.6 27b / llama.cpp / opencode

Reddit r/LocalLLaMA / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post is a community request for the best llama.cpp configuration for running Qwen3.6 27B (GGUF) and includes multiple user-specific working command lines.
Example setups cover different hardware tiers (Windows 2×3080 20GB, a DGX setup, and a single 7900XTX 24GB) and tune parameters like context size, GPU offload, batch/ubatch sizing, and tensor splitting.
Users experiment with very large context lengths (e.g., up to ~196k and ~160k) while adjusting decoding settings (temperature, top-p, top-k, min-p, repetition penalty) to balance speed and quality.
The author reports an update after testing a dual-GPU “turboquant” configuration and found it was slower in start-to-end prompting for analyzing a codebase.
The takeaway is that “best config” is highly hardware- and goal-dependent, with meaningful performance differences caused by context length, GPU layer offloading, and quantization/throughput-related options.

Please share your best config <3

Windows 2x3080 20GB VRAM, DDR4 256GB RAM , llama.ccp, On 100K filled context i have 400/11 pp/tg (My setup):

"A:/0_llama_server/llama-server.exe" -m "a:\0_LM_Studio\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q5_K_XL.gguf" --port 8080 --alias qwen3.5:27b -ngl 999 --threads 22 --flash-attn on --host[0.0.0.0](http://0.0.0.0)--no-mmap -mg 1 --batch-size 1024 --ubatch-size 512 --ctx-checkpoints 128 --ctx-size 196610 --reasoning on --jinja --draft-max 128 --spec-ngram-size-n 48 --draft-min 2 --spec-type ngram-mod --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat_penalty 1.0 --presence_penalty 0.0 --chat-template-kwargs "{"preserve_thinking":true}" --tensor-split 0.46,0.54

DGX (user Impossible_Art9151):

llama-server -hf unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL --host 0.0.0.0 --port 8095 --ctx-size 512000 --no-mmap --parallel 2 --flash-attn on --n-gpu-layers 999 -chat-template-kwargs "{"preserve_thinking":true}" --temp 0.7 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat_penalty 1.0 --presence_penalty 0.0

24gb vram 7900XTX 35t/s, and pp 400, 27t/s at 160k context (user soyalemujica) :

llama-server.exe -ctv q8_0 -ctk q8_0 -c 160000 --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.0 --fit on

UPDATE (My setup): Tested in dual GPU setup turboquant3 and 4, unfortunately it was slower. Start->End (prompting to analyze codebase)

submitted by /u/Familiar_Wish1132
[link] [comments]

Black Hat USA

AI Business

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

Dev.to

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Dev.to

Salesforce Headless 360: Run Your CRM Without a Browser

Dev.to

RAG Systems in Production: Building Enterprise Knowledge Search

Dev.to

Best config for Qwen3.6 27b / llama.cpp / opencode

Key Points

Related Articles

Black Hat USA

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Salesforce Headless 360: Run Your CRM Without a Browser

RAG Systems in Production: Building Enterprise Knowledge Search

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer