Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

Reddit r/LocalLLaMA / 4/16/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author reports successfully tuning llama.cpp on Windows 11 to run Qwen3.5-35B (GGUF, Q4_K_L) at 64k context on an RTX 4060 Ti 16GB, achieving about 40–60 tokens per second in real use.
A specific `models.ini` preset (including `c=65536`, `t=6`, `tb=8`, and MoE/router-related settings like `n-cpu-moe=11`) and a `llama-server.exe` launch command are provided as the working configuration.
Logged examples show throughput staying in the ~41–56 tok/s range across different prompt/generation sizes (e.g., ~1050-token generation and longer multi-turn conversations).
The post emphasizes that startup logs may look correct even when performance is poor, and that deeper runtime parameters (e.g., `n_parallel`, `kv_unified`, context slot/batch settings) are more informative than top-level command-line assumptions.
The author concludes that managing VRAM pressure is more important than maximizing peak benchmark scores, and suggests a potential need for a community database of tuned configs per GPU.

Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to share what is working for me.

models.ini entry:

[qwen3.5-35b-64k] model = Qwen3.5-35B-A3B-UD-Q4_K_L.gguf c = 65536 t = 6 tb = 8 n-cpu-moe = 11 b = 1024 ub = 512 parallel = 2 kv-unified = true

Router start command

llama-server.exe --models-preset models.ini --models-max 1 --host 0.0.0.0 --webui-mcp-proxy --port 8080

What I’m seeing now

With that preset, I’m reliably getting roughly 40–60 tok/s on many tasks, even with Docker Desktop running in the background.

A few examples from the logs:

~56.41 tok/s on a 1050-token generation
~46.84 tok/s on a 234-token continuation after a 1087-token prompt
~44.97 tok/s on a 259-token continuation after checkpoint restore
~41.21 tok/s on a 1676-token generation
~42.71 tok/s on a 1689-token generation in a much longer conversation

So not “benchmark fantasy numbers,” but real usable throughput at 64k on a 4060 Ti 16GB.

Other observations

The startup logs can look “correct” and still produce bad throughput if the effective runtime shape isn’t what you think.
Looking at:
- n_parallel
- kv_unified
- n_ctx_seq
- n_ctx_slot
- n_batch
- n_ubatch was way more useful than just staring at the top-level command line.
Keeping VRAM pressure under control mattered more than squeezing out the absolute highest one-off score.