AI Navigate

Benchmarking Qwen3.5-35B-3AB on 8 GB VRAM gaming laptop: 26 t/s at 100k context window

Reddit r/LocalLLaMA / 3/18/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The benchmark shows Qwen3.5-35B-A3B-UD-Q4_K_XL runs on an 8 GB VRAM gaming laptop (RTX 4060) with 64 GB RAM using llama.cpp, achieving about 26 t/s generation with a 100k context window.
  • The results include context-depth dependent throughput, with 5k context yielding ~403.3 t/s (prompt) and ~34.9 t/s (generation), dropping to ~330.7 t/s (prompt) and ~26.2 t/s (generation) at 100k context.
  • The measurement details specify hardware and software: Lenovo gaming laptop, Windows, RTX 4060 8 GB, i7-14000HX, 64 GB RAM, llama.cpp (build: c5a778891), and the model Qwen3.5-35B-A3B-UD-Q4_K_XL (Unsloth).
  • The author discusses upgrade considerations, noting that a Strix Halo 128 GB may mainly allow higher quotas of the same models rather than enabling larger ones, and is weighing an RX 7900 XTX; they welcome input on these choices.

Hey everyone,

I've seen a couple of benchmarks recently and thought this one may be interesting to some of you as well.

I'm GPU poor (8 GB VRAM) but still need 'large' context windows from time to time when working with local LLMs to process sensitive data/code/information. The 35B-A3B model of the new generation of Qwen models has proven to be particularly attractive in this regard. Surprisingly, my gaming laptop with 8 GB of VRAM and 64 GB RAM achieves about 26 t/s with 100k context size.

Machine & Config:

  • Lenovo gaming laptop (Windows)
  • GPU: NVIDIA GeForce RTX 4060 8 GB
  • CPU: i7-14000HX
  • 64 GB RAM (DDR5 5200 MT/s)
  • Backend: llama.cpp (build: c5a778891 (8233))

Model: Qwen3.5-35B-A3B-UD-Q4_K_XL (Unsloth)

Benchmarks:

llama-bench.exe ` -m "Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf" ` -b 4096 -ub 1024 ` --flash-attn 1 ` -t 16 --cpu-mask 0x0000FFFF --cpu-strict 1 ` --prio 3 ` -ngl 99 -ncmoe 35 ` -d 5000,10000,20000,50000,100000 -r 1 ` --progress 
Context depth Prompt (pp512) Generation (tg128)
5,000 403.28 t/s 34.93 t/s
10,000 391.45 t/s 34.51 t/s
20,000 371.26 t/s 33.40 t/s
50,000 353.15 t/s 29.84 t/s
100,000 330.69 t/s 26.18 t/s

I'm currently considering upgrading my system. My idea was to get a Strix Halo 128 GB, but it seems that compared to my current setup, I would only be able to run higher quants of the same models at slightly improved speed (see: recent benchmarks on Strix Halo), but not larger models. So, I'm considering getting an RX 7900 XTX instead. Any thoughts on that would be highly appreciated!

submitted by /u/External_Dentist1928
[link] [comments]