Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A Reddit user reports successfully running the Gemma4 24B A4B IQ4 NL local LLM at about 25.9 tokens per second on a mini PC with an AMD 7840HS plus an eGPU Radeon 9060 XT (16GB VRAM).
  • They describe using llama.cpp/llama-server with long-context “fit” settings (fit on, fit-ctx 128000, fit-target 256) and various inference parameters, noting the model becomes usable for querying their codebase through OpenCode at this performance level.
  • They find that increasing certain parameters (specifically -b and -ub) further causes the model to fail to load, implying tight VRAM/memory constraints.
  • The post asks the community whether there are unnecessary llama.cpp arguments or opportunities to optimize the command for better stability and efficiency on the given hardware.

I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Gemma4 24B A4B IQ4 NL model I finally reached 25.9 t/s. I even connected it to OpenCode and tried asking questions from my codebase, and it seems usable at this level.

llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ4_NL --fit on --fit-ctx 128000 --fit-target 256 -np 1 -fa on --no-mmap --mlock --threads 8 -b 512 -ub 256 -ctk q8_0 -ctv q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-budget -1 

This is the result of using it this way.

Increase -b and -ub any further, it won't even load. Are there any unnecessary arguments or arguments that could be optimized?

Thanks.

submitted by /u/CrowKing63
[link] [comments]