Qwen3.5-9B.Q4_K_M on RTX 3070 Mobile (8GB) with ik_llama.cpp — optimization findings + ~50 t/s gen speed, looking for tips

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The post documents tuning local AI inference on a laptop (Acer Predator Helios 315-53) using ik_llama.cpp to run Qwen3.5-9B Q4_K_M, achieving ~47.8 t/s generation and ~82 t/s prompt evaluation with VRAM around 97% in the initial naive config.
  • It identifies misapplied MoE flags on a non-MoE model, a silent failure of --mlock that requires system limits, and a batch size that consumed nearly 2 GB of VRAM, which were corrected for better performance.
  • The optimized configurations show notable gains: fixed flags with b2048/ub512 and q8_0K/q4_0V yield ~48.4 t/s gen and ~189.9 t/s prompt eval (VRAM ~80%), while q8_0K / q8_0V achieves ~50.0 t/s gen and ~213.0 t/s prompt eval (VRAM ~84%).
  • The post offers practical tips for local model inference on limited GPUs (adjusting batch size, enabling memory limits, avoiding MoE flags when not needed) and invites others to share results on similar hardware.

Disclouse: This post partly written with the help of Claude Opus 4.6 to help with gathering the info and making it understandable for myself first and foremost.... and this post etc!

Hi!

Been tuning local inference on my laptop and wanted to share some info reallyu because some of it surprised me. Would also love to hear what others are getting on similar hardware.

My setup:

  • Laptop: Acer Predator Helios 315-53
  • CPU: Intel i7-10750H (6P cores / 12 threads)
  • GPU: RTX 3070 Mobile, 8GB VRAM (effectively ~7.7GB usable)
  • RAM: 32GB
  • OS: CachyOS (Arch-based, Linux 6.19)
  • Engine: ik_llama.cpp — ikawrakow's fork of llama.cpp with a lot of extra optimizations
  • Model: Qwen3.5-9B Q4_K_M (Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF)

Starting config (naive):

bash

./build/bin/llama-server \ -m ./models/Qwen3.5-9B.Q4_K_M.gguf \ -ngl 999 \ --n-cpu-moe 36 \ -fa on \ -c 65536 \ -b 4096 \ -ub 2048 \ -ctk q4_0 \ -ctv q4_0 \ --threads 6 \ --threads-batch 12 \ --mlock \ -ger \ -ser 0,1 

Results: ~47.8 t/s gen, ~82 t/s prompt eval. VRAM at ~97%.

What was wrong:

1. MoE flags on a non-MoE model. --n-cpu-moe, -ger, and -ser are all MoE-specific. The model metadata clearly shows n_expert = 0. These flags do nothing or worse. Dropped all three....I dont even know why i tried with these tbh.

2. --mlock was silently failing. The log shows failed to mlock 1417465856-byte buffer: Cannot allocate memory. It was doing nothing. You need ulimit -l unlimited (as root) or a limits.conf entry for this to work.

3. Batch size eating VRAM. -b 4096 was causing a 2004 MiB compute buffer — that's nearly 2GB just for batching, on an 8GB card. For a single-user local server you don't need that. Dropping to -b 2048 -ub 512 cut it to 501 MiB.

Optimized configs and results:

Config Gen (t/s) Prompt eval (t/s) VRAM used
Original (q4_0/q4_0, b4096) 47.8 82.6 ~97%
Fixed flags + b2048/ub512, q8_0K/q4_0V 48.4 189.9 ~80%
q8_0K / q8_0V 50.0 213.0 ~84%

The prompt eval speedup from ~82 → ~213 t/s is huge — mostly from fixing the batch size and letting the GPU actually breathe.

Gen speed barely changed across KV configs (~2% difference between q4_0 and q8_0 values), but quality did, the model generated noticeably more coherent and complete responses with q8_0/q8_0, especially on longer outputs. Worth the extra ~256 MiB.

Prompt:
Implement a working Rust program that finds all prime numbers up to N using the Sieve of Eratosthenes. Then explain step by step how the algorithm works, analyze its time and space complexity, and show example output for N=50. Make the code well-commented.

Final command:

bash

./build/bin/llama-server \ -m ./models/Qwen3.5-9B.Q4_K_M.gguf \ -ngl 999 \ -fa on \ -c 65536 \ -b 2048 \ -ub 512 \ -ctk q8_0 \ -ctv q8_0 \ --threads 6 \ --threads-batch 12 

Things I haven't tried yet / questions:

  • GPU power limit tuning — on laptop Mobile GPUs you can often drop TGP significantly with minimal gen speed loss since inference is memory-bandwidth bound not compute bound. Haven't benchmarked this yet.
  • Other models at this size that work well on 8GB Mobile? Especially anything with good coding or reasoning performance.
  • Anyone else running ik_llama.cpp instead of mainline? The extra ik-specific optimizations (fused ops, graph reuse, etc.) seem genuinely worthwhile.
  • Any tips for the hybrid SSM architecture specifically? The ctx_shift warning is a bit annoying — if you fill context it hard stops, no sliding window.

Happy to share more logs if useful. What are others getting on similar 8GB mobile hardware?

submitted by /u/Expensive_Demand1069
[link] [comments]