GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

A user reports an optimization (“tweaked the GBNF”) that speeds up inference for Qwen3.6 35B-A3B and Qwen3.6 27B while also improving correctness on puzzles.
In testing with RTX 5090 on Fedora 43 using llama.cpp mainline (April 24) and specific GGUF builds, the 27B model saw large reductions in output tokens and puzzle latency when using the grammar.
The “Hi” prompt was used intentionally to reflect community feedback about reasoning drag on simple prompts, and results are framed around reducing that overhead.
Bench results indicate that the 27B model’s coding/general performance score stayed the same (“same score”) while generation time improved, suggesting the grammar change mainly affects efficiency.
The author’s main motivation was to optimize 35B-A3B reasoning traces for faster parallel workloads in llama.cpp, citing better uplift for 35B-A3B than expected.

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Hi folks,

Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too.

The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficient on my 5090 setup as I can perform parallel jobs with llama.cpp on my prod.

Love 27B consistency, but the prefill churn on long horizon work is painful.

Tweaked the GBNF and tested a basic prompt to my custom Rust/Next.js bench to see improvements, and I have to say 35B-A3B had the nicest uplift:

I tested a simply "Hi" prompt, a puzzle, and my custom bench Rust/Next.js (60 task-suite)

Ironically I used the "Hi" prompt since community rightfully complained about the reasoning drag on simple things with the 35B-A3B

Tested Specs
- RTX 5090
- Fedora 43
- llama.cpp mainline April 24th
- Qwen3.6-35B-A3B-APEX-I-Balanced.gguf (-c 216k)
- Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf (-c 114k)
- kv f16
- -b & -ub 256
- qwen's sampling for reasoning+coding

Model	Test	Without grammar	With grammar	Improvement
Qwen3.6 27B	Hi tokens	248	42	83.1% less, 5.90x fewer
Qwen3.6 27B	Puzzle tokens	40,101	7,376	81.6% less, 5.44x fewer
Qwen3.6 27B	Puzzle time	13m36s	2m27s	82.0% faster, 5.55x speedup
Qwen3.6 27B	Bench score	4620	4620	same score
Qwen3.6 27B	Bench time	29m54s	22m20s	25.3% faster, 1.34x speedup
Qwen3.6 27B	Bench throughput	1067 t/s	1193 t/s	+11.8%, +126 t/s
Qwen3.6 35B-A3B	Hi tokens	200	12	94.0% less, 16.67x fewer
Qwen3.6 35B-A3B	Puzzle tokens	30,096	2,592	91.4% less, 11.61x fewer
Qwen3.6 35B-A3B	Puzzle time	2m32s	12s	92.1% faster, 12.67x speedup
Qwen3.6 35B-A3B	Bench score	4620	4740	+2.6%, +120 score
Qwen3.6 35B-A3B	Bench time	33m52s	11m04s	67.3% faster, 3.06x speedup
Qwen3.6 35B-A3B	Bench throughput	1844 t/s	2195 t/s	+19.0%, +351 t/s

Total Score + Finish Time are the keys for the chart - accuracy per memory is personal reference

Qwen3.6 35B-A3B moves from X6 -> X1 as chart leader with massive time reduction and score bump.

Qwen3.6 27B moved from X4 -> X3 due to better finishing time - score maintains.

Total throughput recorded throughout benchmark

Qwen3.6 35B-A3B APEX I-Balanced: 1844 -> 2195 t/s

Qwen3.6 27B Uncensored HauHauCS Aggressive Q6_K_P: 1067 -> 1193 t/s

The Rust/Next.js bench is script-injected sequentially with OpenCode and it's performed on a prod repo for financial applications, so it's not publicly shared.

Puzzle Prompt

It's worth nothing, 35B-A3B struggled immensely with this puzzle. It would occasionally loop towards the end of CoT or get incorrect answers. Since it took me 12s vs +2m, it was easy to retry and get correct answers.

You are given a constrained planning problem. Think carefully, verify each condition, and do not skip impossibility checks. Problem: A courier starts at point S and must visit exactly once each of the locations A, B, C, D, and E, then end at T. Travel times (in minutes) are symmetric: S-A 4, S-B 6, S-C 8, S-D 7, S-E 9 A-B 5, A-C 7, A-D 3, A-E 8 B-C 4, B-D 6, B-E 5 C-D 5, C-E 3 D-E 6 A-T 8, B-T 6, C-T 5, D-T 7, E-T 4 Constraints: 1. C cannot be visited before B. 2. D must be visited immediately after A. 3. E cannot be the last location before T. 4. Total travel time must be less than 28 minutes. 5. Exactly one of these must be true: - B is visited second - C is visited fourth 6. If A is visited first, then B must be visited third. 7. The route must include at least one step whose travel time is exactly 3 minutes. Task: Determine whether a valid route exists. - If it exists, provide one valid route and its total time. - If it does not exist, prove why no valid route can satisfy all constraints. - Show your reasoning clearly and check every constraint explicitly. - Do not guess. If multiple routes seem possible, test them against all rules before concluding. Output format: 1. Conclusion: VALID ROUTE EXISTS / NO VALID ROUTE EXISTS 2. Route: ... 3. Total time: ... 4. Constraint check: ... 5. Brief proof: ...

The answer should be NO VALID ROUTE EXISTS. The models churn through this one.

GBNF Grammar

root ::= think out think ::= "<think>
" "Q=" q "
" "M=" m "
" "K=" toks "
" "R=" toks "
" "V=" v "
" "</think>

" q ::= "solve" | "prove" | "route" | "debug" | "patch" | "code" | "calc" | "compare" | "explain" m ::= "case" | "enum" | "check" | "derive" | "edit" | "test" | "trace" | "rank" v ::= "ok" | "fail" | "done" | "blocked" | "candidate" | "verify" toks ::= tok | tok "," tok | tok "," tok "," tok | tok "," tok "," tok "," tok | tok "," tok "," tok "," tok "," tok tok ::= [A-Za-z][A-Za-z0-9_.!<>=/-]{0,18} out ::= [\x09\x0A\x0D\x20-\x7E]+

I've only noticed some thinking tags outside CoT on Open WebUI.

Outside of that, it works on Hermes, llama.cpp's WebUI and OpenCode without issue.

Since I did not have more time to use on my prod - past sleep time - I hope this gives some boost on your setup.

submitted by /u/Holiday_Purpose_3166
[link] [comments]

Black Hat USA

AI Business

China’s DeepSeek prices new V4 AI model at 97% below OpenAI’s GPT-5.5

SCMP Tech

I built Dispatch AI. I just wanted to share it. If you find it cool, take a look and leave a comment.

Dev.to

Replit AI Agent: Practical Guide for Dev Workflows

Dev.to

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks

VentureBeat

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

Key Points

Related Articles

Black Hat USA

China’s DeepSeek prices new V4 AI model at 97% below OpenAI’s GPT-5.5

I built Dispatch AI. I just wanted to share it. If you find it cool, take a look and leave a comment.

Replit AI Agent: Practical Guide for Dev Workflows

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer