Benchmarking Qwen3.5-35B-3AB on 8 GB VRAM gaming laptop: 26 t/s at 100k context window

Reddit r/LocalLLaMA / 3/18/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The benchmark shows Qwen3.5-35B-A3B-UD-Q4_K_XL runs on an 8 GB VRAM gaming laptop (RTX 4060) with 64 GB RAM using llama.cpp, achieving about 26 t/s generation with a 100k context window.
The results include context-depth dependent throughput, with 5k context yielding ~403.3 t/s (prompt) and ~34.9 t/s (generation), dropping to ~330.7 t/s (prompt) and ~26.2 t/s (generation) at 100k context.
The measurement details specify hardware and software: Lenovo gaming laptop, Windows, RTX 4060 8 GB, i7-14000HX, 64 GB RAM, llama.cpp (build: c5a778891), and the model Qwen3.5-35B-A3B-UD-Q4_K_XL (Unsloth).
The author discusses upgrade considerations, noting that a Strix Halo 128 GB may mainly allow higher quotas of the same models rather than enabling larger ones, and is weighing an RX 7900 XTX; they welcome input on these choices.

Hey everyone,

I've seen a couple of benchmarks recently and thought this one may be interesting to some of you as well.

I'm GPU poor (8 GB VRAM) but still need 'large' context windows from time to time when working with local LLMs to process sensitive data/code/information. The 35B-A3B model of the new generation of Qwen models has proven to be particularly attractive in this regard. Surprisingly, my gaming laptop with 8 GB of VRAM and 64 GB RAM achieves about 26 t/s with 100k context size.

Machine & Config:

Lenovo gaming laptop (Windows)
GPU: NVIDIA GeForce RTX 4060 8 GB
CPU: i7-14000HX
64 GB RAM (DDR5 5200 MT/s)
Backend: llama.cpp (build: c5a778891 (8233))

Model: Qwen3.5-35B-A3B-UD-Q4_K_XL (Unsloth)

Benchmarks:

llama-bench.exe ` -m "Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf" ` -b 4096 -ub 1024 ` --flash-attn 1 ` -t 16 --cpu-mask 0x0000FFFF --cpu-strict 1 ` --prio 3 ` -ngl 99 -ncmoe 35 ` -d 5000,10000,20000,50000,100000 -r 1 ` --progress

Context depth	Prompt (pp512)	Generation (tg128)
5,000	403.28 t/s	34.93 t/s
10,000	391.45 t/s	34.51 t/s
20,000	371.26 t/s	33.40 t/s
50,000	353.15 t/s	29.84 t/s
100,000	330.69 t/s	26.18 t/s

I'm currently considering upgrading my system. My idea was to get a Strix Halo 128 GB, but it seems that compared to my current setup, I would only be able to run higher quants of the same models at slightly improved speed (see: recent benchmarks on Strix Halo), but not larger models. So, I'm considering getting an RX 7900 XTX instead. Any thoughts on that would be highly appreciated!

submitted by /u/External_Dentist1928
[link] [comments]

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Benchmarking Qwen3.5-35B-3AB on 8 GB VRAM gaming laptop: 26 t/s at 100k context window

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像