Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A Reddit user reports successfully running the Gemma4 24B A4B IQ4 NL local LLM at about 25.9 tokens per second on a mini PC with an AMD 7840HS plus an eGPU Radeon 9060 XT (16GB VRAM).
They describe using llama.cpp/llama-server with long-context “fit” settings (fit on, fit-ctx 128000, fit-target 256) and various inference parameters, noting the model becomes usable for querying their codebase through OpenCode at this performance level.
They find that increasing certain parameters (specifically -b and -ub) further causes the model to fail to load, implying tight VRAM/memory constraints.
The post asks the community whether there are unnecessary llama.cpp arguments or opportunities to optimize the command for better stability and efficiency on the given hardware.

I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Gemma4 24B A4B IQ4 NL model I finally reached 25.9 t/s. I even connected it to OpenCode and tried asking questions from my codebase, and it seems usable at this level.

llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-IQ4_NL --fit on --fit-ctx 128000 --fit-target 256 -np 1 -fa on --no-mmap --mlock --threads 8 -b 512 -ub 256 -ctk q8_0 -ctv q8_0 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-budget -1

This is the result of using it this way.

Increase -b and -ub any further, it won't even load. Are there any unnecessary arguments or arguments that could be optimized?

Thanks.

submitted by /u/CrowKing63
[link] [comments]

Black Hat USA

AI Business

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest

Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

Dev.to

Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s

Key Points

Related Articles

Black Hat USA

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Announcing the NVIDIA Nemotron 3 Super Build Contest

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer