[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

A Reddit user tested Qwen3.6 35B (a3b) locally with an 8GB VRAM / 32GB RAM setup, using a high-performance configuration (CtxSize 131,072, GpuLayers 99, and K/V cache set to q8_0).
They observed that the Q4_K_XL quantized model (from Unsloth) ran slightly faster than Q4_K_M, with higher tokens/sec and lower average wall time (about 7.5% faster in their measurements).
The XL variant also produced fewer average output tokens (about 4.5% less), which contributed to the improved overall responsiveness despite using more memory.
The user suspects part of the slowdown in the first run (~33%) comes from a timing/benchmarking bug related to MoE models needing components copied from storage into RAM, even though the test was repeated five times.
Overall, the post suggests that for this specific hardware and workflow, the Q4_K_XL model can provide a better speed/efficiency tradeoff than Q4_K_M under the same top configuration.

Config

CtxSize: 131,072
GpuLayers: 99
CpuMoeLayers: 38
Threads: 16
BatchSize/UBatchSize: 4096/4096
CacheType K/V: q8_0
Tool Context: file mode (tools.kilocode.official.md)

Metric	M Model	XL Model	Difference
Avg Tokens/sec	28.92	29.78	+0.86 (+3.0%)
Median Tokens/sec	30.96	32.08	+1.12 (+3.6%)
Avg Wall Seconds	108.03s	99.93s	-8.10s (-7.5%)
Avg Output Tokens	3,031.8	2,895.8	-136 (-4.5%)
Avg Input Tokens/sec	50.20	55.96	+5.76 (+11.5%)
Avg Decode Tokens/sec	75.89	76.44	+0.55 (+0.7%)

Runs ~33% slower for the first run because my code has a bug that includes the initiation time, and as you know for an moe model you have to pass it from storage into ram. It's run 5 times to try to cancel is out, but still included it because that's how i would realistically use it (turning it on, using it once, turning it off to run something, etc).

submitted by /u/EggDroppedSoup
[link] [comments]

Black Hat USA

AI Business

Your Agent Isn't Reflecting. It's Performing Reflection.

Dev.to

The Context Window Is a Lie

Dev.to

7 Transaction Types Your AI Agent Can Execute: From Transfers to Contract Deployment

Dev.to

Day 7 of Building GoDavaii: Why My Grandmother's Four Medicines Inspired India's Health AI

Dev.to

[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage

Key Points

Related Articles

Black Hat USA

Your Agent Isn't Reflecting. It's Performing Reflection.

The Context Window Is a Lie

7 Transaction Types Your AI Agent Can Execute: From Transfers to Contract Deployment

Day 7 of Building GoDavaii: Why My Grandmother's Four Medicines Inspired India's Health AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer