Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to share what is working for me.
models.ini entry:
[qwen3.5-35b-64k] model = Qwen3.5-35B-A3B-UD-Q4_K_L.gguf c = 65536 t = 6 tb = 8 n-cpu-moe = 11 b = 1024 ub = 512 parallel = 2 kv-unified = true Router start command
llama-server.exe --models-preset models.ini --models-max 1 --host 0.0.0.0 --webui-mcp-proxy --port 8080 What I’m seeing now
With that preset, I’m reliably getting roughly 40–60 tok/s on many tasks, even with Docker Desktop running in the background.
A few examples from the logs:
- ~56.41 tok/s on a 1050-token generation
- ~46.84 tok/s on a 234-token continuation after a 1087-token prompt
- ~44.97 tok/s on a 259-token continuation after checkpoint restore
- ~41.21 tok/s on a 1676-token generation
- ~42.71 tok/s on a 1689-token generation in a much longer conversation
So not “benchmark fantasy numbers,” but real usable throughput at 64k on a 4060 Ti 16GB.
Other observations
- The startup logs can look “correct” and still produce bad throughput if the effective runtime shape isn’t what you think.
- Looking at:
n_parallelkv_unifiedn_ctx_seqn_ctx_slotn_batchn_ubatchwas way more useful than just staring at the top-level command line.
- Keeping VRAM pressure under control mattered more than squeezing out the absolute highest one-off score.
I did not find a database of tuned configs for various cards, but might be something useful to have.
[link] [comments]



