| Hi folks, Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too. The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficient on my 5090 setup as I can perform parallel jobs with llama.cpp on my prod. Love 27B consistency, but the prefill churn on long horizon work is painful. Tweaked the GBNF and tested a basic prompt to my custom Rust/Next.js bench to see improvements, and I have to say 35B-A3B had the nicest uplift: I tested a simply "Hi" prompt, a puzzle, and my custom bench Rust/Next.js (60 task-suite) Ironically I used the "Hi" prompt since community rightfully complained about the reasoning drag on simple things with the 35B-A3B Tested Specs
Total Score + Finish Time are the keys for the chart - accuracy per memory is personal reference Qwen3.6 35B-A3B moves from X6 -> X1 as chart leader with massive time reduction and score bump. Qwen3.6 27B moved from X4 -> X3 due to better finishing time - score maintains. Total throughput recorded throughout benchmark Qwen3.6 35B-A3B APEX I-Balanced: 1844 -> 2195 t/s Qwen3.6 27B Uncensored HauHauCS Aggressive Q6_K_P: 1067 -> 1193 t/s The Rust/Next.js bench is script-injected sequentially with OpenCode and it's performed on a prod repo for financial applications, so it's not publicly shared. Puzzle Prompt It's worth nothing, 35B-A3B struggled immensely with this puzzle. It would occasionally loop towards the end of CoT or get incorrect answers. Since it took me 12s vs +2m, it was easy to retry and get correct answers. The answer should be NO VALID ROUTE EXISTS. The models churn through this one. GBNF Grammar I've only noticed some thinking tags outside CoT on Open WebUI. Outside of that, it works on Hermes, llama.cpp's WebUI and OpenCode without issue. Since I did not have more time to use on my prod - past sleep time - I hope this gives some boost on your setup. [link] [comments] |
GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B
Reddit r/LocalLLaMA / 4/28/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- A user reports an optimization (“tweaked the GBNF”) that speeds up inference for Qwen3.6 35B-A3B and Qwen3.6 27B while also improving correctness on puzzles.
- In testing with RTX 5090 on Fedora 43 using llama.cpp mainline (April 24) and specific GGUF builds, the 27B model saw large reductions in output tokens and puzzle latency when using the grammar.
- The “Hi” prompt was used intentionally to reflect community feedback about reasoning drag on simple prompts, and results are framed around reducing that overhead.
- Bench results indicate that the 27B model’s coding/general performance score stayed the same (“same score”) while generation time improved, suggesting the grammar change mainly affects efficiency.
- The author’s main motivation was to optimize 35B-A3B reasoning traces for faster parallel workloads in llama.cpp, citing better uplift for 35B-A3B than expected.
Related Articles

Black Hat USA
AI Business

China’s DeepSeek prices new V4 AI model at 97% below OpenAI’s GPT-5.5
SCMP Tech

I built Dispatch AI. I just wanted to share it. If you find it cool, take a look and leave a comment.
Dev.to

Replit AI Agent: Practical Guide for Dev Workflows
Dev.to

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks
VentureBeat