AI Navigate

Benchmarked all unsloth Qwen3.5-35B-A3B Q4 models on a 3090

Reddit r/LocalLLaMA / 3/11/2026

📰 NewsModels & Research

Key Points

  • The article presents benchmark results for several Qwen3.5-35B-A3B Q4-Q3 quantized language models tested on an RTX 3090 GPU with a 10K context length.
  • Performance metrics reported include file size, prompt evaluation speed (tokens per second), generation speed, and perplexity scores for each model variant.
  • The fastest model found, UD-Q4_K_M, was deleted by its creator unsloth and is no longer available, but UD-Q4_K_L is noted as a potential replacement.
  • Benchmarks aim to help users understand trade-offs between model size, speed, and quality in the Q4-Q3 quantized versions.
  • The test excludes models smaller than Q3_K_S to focus on larger, more capable variants for practical evaluation on a high-end consumer GPU.

Qwen3.5-35B-A3B Q4-Q3 Model Benchmarks (RTX 3090)

Another day, another useless or maybe not that useless table with numbers.

This time i benchmarked Qwen3.5-35B-A3B in the Q4-Q3 range with a context of 10K. I did omit everything smaler in filesize then the Q3_K_S in this test.

Results:

Model File Size Prompt Eval (t/s) Generation (t/s) Perplexity (PPL)
Q3_K_S 15266MB 2371.78 ± 12.27 117.12 ± 0.38 6.7653 ± 0.04332
Q3_K_M 16357MB 2401.14 ± 9.51 120.23 ± 0.84 6.6829 ± 0.04268
UD-Q3_K_XL 16602MB 2394.04 ± 10.50 119.17 ± 0.17 6.6920 ± 0.04277
UD-IQ4_XS 17487MB 2348.84 ± 19.65 117.76 ± 0.90 6.6294 ± 0.04226
UD-IQ4_NL 17822MB 2355.98 ± 14.76 120.28 ± 0.58 6.6299 ± 0.04226
UD-Q4_K_M 19855MB 2354.98 ± 13.63 132.27 ± 0.59 6.6059 ± 0.04208
UD-Q4_K_L 20206MB 2364.87 ± 13.44 127.64 ± 0.48 6.5889 ± 0.04204
Q4_K_S 20674MB 2355.96 ± 14.75 121.23 ± 0.60 6.5888 ± 0.04200
Q4_K_M 22017MB 2343.71 ± 9.35 121.00 ± 0.90 6.5593 ± 0.04173
UD-Q4_K_XL 22242MB 2335.45 ± 10.18 119.38 ± 0.84 6.5523 ± 0.04169

Notes

The fastest model in this list UD-Q4_K_M is not available anymore and got deleted by unsloth. It looks like it can somewhat be replaced with the UD-Q4_K_L.

Edit: Since a lot of people (including me) seem to be unsure if they should run the 27B vs the 35B-A3B i made one more benchmark run now.

I chose two models of similar sizes from each and tried to fill the context until i i get segfaults to one. So Qwen3.5-27B was the verdict here at a context lenght of 120k.

./llama-bench -m "./Qwen3.5-27B-Q4_K_M.gguf" -ngl 99 -d 120000 -fa 1 ./llama-bench -m "./Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf" -ngl 99 -d 120000 -fa 1 | Model | File Size | VRAM Used | Prompt Eval (t/s) | Generation (t/s) | |---------------------------------|-----------|------------------|-------------------|------------------| | Qwen3.5-27B-Q4_K_M | 15.58 GiB | 23.794 GiB / 24 | 509.27 ± 8.73 | 29.30 ± 0.01 | | Qwen3.5-35B-A3B-UD-Q3_K_XL | 15.45 GiB | 18.683 GiB / 24 | 1407.86 ± 5.49 | 93.95 ± 0.11 |

So i get ~3x speed without cpu offloading at the same context lenght out of the 35B-A3B.

Whats interesting is is that i was able to even specify the full context lenght for the 35B-A3B without my gpu having to offload anything with flash attention turned on using llama-bench (maybe fit is automatically turned on? does not feel alright at least!):

./llama-bench -m "./Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf" -ngl 99 -d 262144 -fa 1 | Model | File Size | VRAM Used | Prompt Eval (t/s) | Generation (t/s) | |---------------------------------|-----------|------------------|-------------------|------------------| | Qwen3.5-35B-A3B-UD-Q3_K_XL | 15.45 GiB | 21.697 GiB / 24 | 854.13 ± 2.47 | 70.96 ± 0.19 |

at full context lenght the tg of the 35B-A3B is still 2.5x faster then the 27B with a ctx-l of 120k.

submitted by /u/StrikeOner
[link] [comments]