Hey everyone,
I've seen a couple of benchmarks recently and thought this one may be interesting to some of you as well.
I'm GPU poor (8 GB VRAM) but still need 'large' context windows from time to time when working with local LLMs to process sensitive data/code/information. The 35B-A3B model of the new generation of Qwen models has proven to be particularly attractive in this regard. Surprisingly, my gaming laptop with 8 GB of VRAM and 64 GB RAM achieves about 26 t/s with 100k context size.
Machine & Config:
- Lenovo gaming laptop (Windows)
- GPU: NVIDIA GeForce RTX 4060 8 GB
- CPU: i7-14000HX
- 64 GB RAM (DDR5 5200 MT/s)
- Backend: llama.cpp (build: c5a778891 (8233))
Model: Qwen3.5-35B-A3B-UD-Q4_K_XL (Unsloth)
Benchmarks:
llama-bench.exe ` -m "Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf" ` -b 4096 -ub 1024 ` --flash-attn 1 ` -t 16 --cpu-mask 0x0000FFFF --cpu-strict 1 ` --prio 3 ` -ngl 99 -ncmoe 35 ` -d 5000,10000,20000,50000,100000 -r 1 ` --progress | Context depth | Prompt (pp512) | Generation (tg128) |
|---|---|---|
| 5,000 | 403.28 t/s | 34.93 t/s |
| 10,000 | 391.45 t/s | 34.51 t/s |
| 20,000 | 371.26 t/s | 33.40 t/s |
| 50,000 | 353.15 t/s | 29.84 t/s |
| 100,000 | 330.69 t/s | 26.18 t/s |
I'm currently considering upgrading my system. My idea was to get a Strix Halo 128 GB, but it seems that compared to my current setup, I would only be able to run higher quants of the same models at slightly improved speed (see: recent benchmarks on Strix Halo), but not larger models. So, I'm considering getting an RX 7900 XTX instead. Any thoughts on that would be highly appreciated!
[link] [comments]




