GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A user reports an optimization (“tweaked the GBNF”) that speeds up inference for Qwen3.6 35B-A3B and Qwen3.6 27B while also improving correctness on puzzles.
  • In testing with RTX 5090 on Fedora 43 using llama.cpp mainline (April 24) and specific GGUF builds, the 27B model saw large reductions in output tokens and puzzle latency when using the grammar.
  • The “Hi” prompt was used intentionally to reflect community feedback about reasoning drag on simple prompts, and results are framed around reducing that overhead.
  • Bench results indicate that the 27B model’s coding/general performance score stayed the same (“same score”) while generation time improved, suggesting the grammar change mainly affects efficiency.
  • The author’s main motivation was to optimize 35B-A3B reasoning traces for faster parallel workloads in llama.cpp, citing better uplift for 35B-A3B than expected.
GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Hi folks,

Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too.

The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficient on my 5090 setup as I can perform parallel jobs with llama.cpp on my prod.

Love 27B consistency, but the prefill churn on long horizon work is painful.

Tweaked the GBNF and tested a basic prompt to my custom Rust/Next.js bench to see improvements, and I have to say 35B-A3B had the nicest uplift:

I tested a simply "Hi" prompt, a puzzle, and my custom bench Rust/Next.js (60 task-suite)

Ironically I used the "Hi" prompt since community rightfully complained about the reasoning drag on simple things with the 35B-A3B

Tested Specs
- RTX 5090
- Fedora 43
- llama.cpp mainline April 24th
- Qwen3.6-35B-A3B-APEX-I-Balanced.gguf (-c 216k)
- Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf (-c 114k)
- kv f16
- -b & -ub 256
- qwen's sampling for reasoning+coding

Model Test Without grammar With grammar Improvement
Qwen3.6 27B Hi tokens 248 42 83.1% less, 5.90x fewer
Qwen3.6 27B Puzzle tokens 40,101 7,376 81.6% less, 5.44x fewer
Qwen3.6 27B Puzzle time 13m36s 2m27s 82.0% faster, 5.55x speedup
Qwen3.6 27B Bench score 4620 4620 same score
Qwen3.6 27B Bench time 29m54s 22m20s 25.3% faster, 1.34x speedup
Qwen3.6 27B Bench throughput 1067 t/s 1193 t/s +11.8%, +126 t/s
Qwen3.6 35B-A3B Hi tokens 200 12 94.0% less, 16.67x fewer
Qwen3.6 35B-A3B Puzzle tokens 30,096 2,592 91.4% less, 11.61x fewer
Qwen3.6 35B-A3B Puzzle time 2m32s 12s 92.1% faster, 12.67x speedup
Qwen3.6 35B-A3B Bench score 4620 4740 +2.6%, +120 score
Qwen3.6 35B-A3B Bench time 33m52s 11m04s 67.3% faster, 3.06x speedup
Qwen3.6 35B-A3B Bench throughput 1844 t/s 2195 t/s +19.0%, +351 t/s

Total Score + Finish Time are the keys for the chart - accuracy per memory is personal reference

Qwen3.6 35B-A3B moves from X6 -> X1 as chart leader with massive time reduction and score bump.

Qwen3.6 27B moved from X4 -> X3 due to better finishing time - score maintains.

Total throughput recorded throughout benchmark

Qwen3.6 35B-A3B APEX I-Balanced: 1844 -> 2195 t/s

Qwen3.6 27B Uncensored HauHauCS Aggressive Q6_K_P: 1067 -> 1193 t/s

The Rust/Next.js bench is script-injected sequentially with OpenCode and it's performed on a prod repo for financial applications, so it's not publicly shared.

Puzzle Prompt

It's worth nothing, 35B-A3B struggled immensely with this puzzle. It would occasionally loop towards the end of CoT or get incorrect answers. Since it took me 12s vs +2m, it was easy to retry and get correct answers.

You are given a constrained planning problem. Think carefully, verify each condition, and do not skip impossibility checks. Problem: A courier starts at point S and must visit exactly once each of the locations A, B, C, D, and E, then end at T. Travel times (in minutes) are symmetric: S-A 4, S-B 6, S-C 8, S-D 7, S-E 9 A-B 5, A-C 7, A-D 3, A-E 8 B-C 4, B-D 6, B-E 5 C-D 5, C-E 3 D-E 6 A-T 8, B-T 6, C-T 5, D-T 7, E-T 4 Constraints: 1. C cannot be visited before B. 2. D must be visited immediately after A. 3. E cannot be the last location before T. 4. Total travel time must be less than 28 minutes. 5. Exactly one of these must be true: - B is visited second - C is visited fourth 6. If A is visited first, then B must be visited third. 7. The route must include at least one step whose travel time is exactly 3 minutes. Task: Determine whether a valid route exists. - If it exists, provide one valid route and its total time. - If it does not exist, prove why no valid route can satisfy all constraints. - Show your reasoning clearly and check every constraint explicitly. - Do not guess. If multiple routes seem possible, test them against all rules before concluding. Output format: 1. Conclusion: VALID ROUTE EXISTS / NO VALID ROUTE EXISTS 2. Route: ... 3. Total time: ... 4. Constraint check: ... 5. Brief proof: ... 

The answer should be NO VALID ROUTE EXISTS. The models churn through this one.

GBNF Grammar

root ::= think out think ::= "<think>
" "Q=" q "
" "M=" m "
" "K=" toks "
" "R=" toks "
" "V=" v "
" "</think>

" q ::= "solve" | "prove" | "route" | "debug" | "patch" | "code" | "calc" | "compare" | "explain" m ::= "case" | "enum" | "check" | "derive" | "edit" | "test" | "trace" | "rank" v ::= "ok" | "fail" | "done" | "blocked" | "candidate" | "verify" toks ::= tok | tok "," tok | tok "," tok "," tok | tok "," tok "," tok "," tok | tok "," tok "," tok "," tok "," tok tok ::= [A-Za-z][A-Za-z0-9_.!<>=/-]{0,18} out ::= [\x09\x0A\x0D\x20-\x7E]+ 

I've only noticed some thinking tags outside CoT on Open WebUI.

Outside of that, it works on Hermes, llama.cpp's WebUI and OpenCode without issue.

Since I did not have more time to use on my prod - past sleep time - I hope this gives some boost on your setup.

submitted by /u/Holiday_Purpose_3166
[link] [comments]