Benchmarked all unsloth Qwen3.5-35B-A3B Q4 models on a 3090

Reddit r/LocalLLaMA / 3/11/2026

📰 NewsModels & Research

共有:

Key Points

The article presents benchmark results for several Qwen3.5-35B-A3B Q4-Q3 quantized language models tested on an RTX 3090 GPU with a 10K context length.
Performance metrics reported include file size, prompt evaluation speed (tokens per second), generation speed, and perplexity scores for each model variant.
The fastest model found, UD-Q4_K_M, was deleted by its creator unsloth and is no longer available, but UD-Q4_K_L is noted as a potential replacement.
Benchmarks aim to help users understand trade-offs between model size, speed, and quality in the Q4-Q3 quantized versions.
The test excludes models smaller than Q3_K_S to focus on larger, more capable variants for practical evaluation on a high-end consumer GPU.

Qwen3.5-35B-A3B Q4-Q3 Model Benchmarks (RTX 3090)

Another day, another useless or maybe not that useless table with numbers.

This time i benchmarked Qwen3.5-35B-A3B in the Q4-Q3 range with a context of 10K. I did omit everything smaler in filesize then the Q3_K_S in this test.

Results:

Model	File Size	Prompt Eval (t/s)	Generation (t/s)	Perplexity (PPL)
Q3_K_S	15266MB	2371.78 ± 12.27	117.12 ± 0.38	6.7653 ± 0.04332
Q3_K_M	16357MB	2401.14 ± 9.51	120.23 ± 0.84	6.6829 ± 0.04268
UD-Q3_K_XL	16602MB	2394.04 ± 10.50	119.17 ± 0.17	6.6920 ± 0.04277
UD-IQ4_XS	17487MB	2348.84 ± 19.65	117.76 ± 0.90	6.6294 ± 0.04226
UD-IQ4_NL	17822MB	2355.98 ± 14.76	120.28 ± 0.58	6.6299 ± 0.04226
UD-Q4_K_M	19855MB	2354.98 ± 13.63	132.27 ± 0.59	6.6059 ± 0.04208
UD-Q4_K_L	20206MB	2364.87 ± 13.44	127.64 ± 0.48	6.5889 ± 0.04204
Q4_K_S	20674MB	2355.96 ± 14.75	121.23 ± 0.60	6.5888 ± 0.04200
Q4_K_M	22017MB	2343.71 ± 9.35	121.00 ± 0.90	6.5593 ± 0.04173
UD-Q4_K_XL	22242MB	2335.45 ± 10.18	119.38 ± 0.84	6.5523 ± 0.04169

Notes

The fastest model in this list UD-Q4_K_M is not available anymore and got deleted by unsloth. It looks like it can somewhat be replaced with the UD-Q4_K_L.

Edit: Since a lot of people (including me) seem to be unsure if they should run the 27B vs the 35B-A3B i made one more benchmark run now.

I chose two models of similar sizes from each and tried to fill the context until i i get segfaults to one. So Qwen3.5-27B was the verdict here at a context lenght of 120k.

./llama-bench -m "./Qwen3.5-27B-Q4_K_M.gguf" -ngl 99 -d 120000 -fa 1 ./llama-bench -m "./Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf" -ngl 99 -d 120000 -fa 1 | Model | File Size | VRAM Used | Prompt Eval (t/s) | Generation (t/s) | |---------------------------------|-----------|------------------|-------------------|------------------| | Qwen3.5-27B-Q4_K_M | 15.58 GiB | 23.794 GiB / 24 | 509.27 ± 8.73 | 29.30 ± 0.01 | | Qwen3.5-35B-A3B-UD-Q3_K_XL | 15.45 GiB | 18.683 GiB / 24 | 1407.86 ± 5.49 | 93.95 ± 0.11 |

So i get ~3x speed without cpu offloading at the same context lenght out of the 35B-A3B.

Whats interesting is is that i was able to even specify the full context lenght for the 35B-A3B without my gpu having to offload anything with flash attention turned on using llama-bench (maybe fit is automatically turned on? does not feel alright at least!):

./llama-bench -m "./Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf" -ngl 99 -d 262144 -fa 1 | Model | File Size | VRAM Used | Prompt Eval (t/s) | Generation (t/s) | |---------------------------------|-----------|------------------|-------------------|------------------| | Qwen3.5-35B-A3B-UD-Q3_K_XL | 15.45 GiB | 21.697 GiB / 24 | 854.13 ± 2.47 | 70.96 ± 0.19 |

at full context lenght the tg of the 35B-A3B is still 2.5x faster then the 27B with a ctx-l of 120k.

submitted by /u/StrikeOner
[link] [comments]

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

Reddit r/LocalLLaMA

acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan

Reddit r/LocalLLaMA

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Hugging Face Blog

Newest GPU server in the lab! 72gb ampere vram!

Reddit r/LocalLLaMA

Benchmarked all unsloth Qwen3.5-35B-A3B Q4 models on a 3090

Key Points

Qwen3.5-35B-A3B Q4-Q3 Model Benchmarks (RTX 3090)

Results:

Notes

Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Newest GPU server in the lab! 72gb ampere vram!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Qwen3.5-35B-A3B Q4-Q3 Model Benchmarks (RTX 3090)

Results:

Notes

Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Newest GPU server in the lab! 72gb ampere vram!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding