I finally found the best 5070 TI + 32GB ram GGUF model

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A Reddit user reports that the Gemma 4 26B A3B “IQ4 NL” GGUF model (“gemma-4-26B-A4B-it-UD-IQ4_NL.gguf”) works very well on a 5070 Ti with 32GB RAM as a local co-assistant.
They provide a specific llama.cpp launch command and note it largely follows Google’s recommended settings, achieving strong responsiveness (under ~100 tokens/sec) and large context behavior.
The model scores about 6.5/10 in the user’s tests, successfully reading and following a local guide.md and handling file reading and related tasks.
The main weakness identified is difficulty with package/integrated code structure—specifically connecting files accurately and handling package-level intricacies.
Overall, the user claims it passed their “carwash test” among few candidates, making it a practical choice for VS Code workflows alongside Claude Code.

it's the Gemma 4 26B A3B IQ4 NL.

My llama.cpp command is:

llama-server.exe -m "gemma-4-26B-A4B-it-UD-IQ4_NL.gguf" -ngl 999 -fa on -c 65536 -ctk q8_0 -ctv q8_0 --batch-size 1024 --ubatch-size 512 --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --no-warmup --port 8080 --host 0.0.0.0 --chat-template-kwargs "{\"enable_thinking\":true}" --perf

In essence, this is just the recommended setting's from Google, but this has served me damn well as a co-assistant to Claude Code in VS Code.

I gave it tests, and it's around 6.5/10. It reads my guide.md, it follows it, reads files, and many more. Its main issue is that it can't get past the intricacies of packages. What I mean by that is that it can't connect files to each other with full accuracy.

But that's it for its issues. Everything else has been great since it has a large context size and fast <100 tokens per second. This is one of the few models that have passed the carwash test from my testing.

submitted by /u/FrozenFishEnjoyer
[link] [comments]