[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A Reddit user tested Qwen3.6 35B (a3b) locally with an 8GB VRAM / 32GB RAM setup, using a high-performance configuration (CtxSize 131,072, GpuLayers 99, and K/V cache set to q8_0).
  • They observed that the Q4_K_XL quantized model (from Unsloth) ran slightly faster than Q4_K_M, with higher tokens/sec and lower average wall time (about 7.5% faster in their measurements).
  • The XL variant also produced fewer average output tokens (about 4.5% less), which contributed to the improved overall responsiveness despite using more memory.
  • The user suspects part of the slowdown in the first run (~33%) comes from a timing/benchmarking bug related to MoE models needing components copied from storage into RAM, even though the test was repeated five times.
  • Overall, the post suggests that for this specific hardware and workflow, the Q4_K_XL model can provide a better speed/efficiency tradeoff than Q4_K_M under the same top configuration.

Config

  • CtxSize: 131,072
  • GpuLayers: 99
  • CpuMoeLayers: 38
  • Threads: 16
  • BatchSize/UBatchSize: 4096/4096
  • CacheType K/V: q8_0
  • Tool Context: file mode (tools.kilocode.official.md)
Metric M Model XL Model Difference
Avg Tokens/sec 28.92 29.78 +0.86 (+3.0%)
Median Tokens/sec 30.96 32.08 +1.12 (+3.6%)
Avg Wall Seconds 108.03s 99.93s -8.10s (-7.5%)
Avg Output Tokens 3,031.8 2,895.8 -136 (-4.5%)
Avg Input Tokens/sec 50.20 55.96 +5.76 (+11.5%)
Avg Decode Tokens/sec 75.89 76.44 +0.55 (+0.7%)

Runs ~33% slower for the first run because my code has a bug that includes the initiation time, and as you know for an moe model you have to pass it from storage into ram. It's run 5 times to try to cancel is out, but still included it because that's how i would realistically use it (turning it on, using it once, turning it off to run something, etc).

submitted by /u/EggDroppedSoup
[link] [comments]