how do you decide between q4 and q5 on a 70b when 24gb is the cap?

Reddit r/LocalLLaMA / 5/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The author is struggling with choosing between Q4 and Q5 quantization for a 70B model on a GPU capped at 24GB, where Q4 fits comfortably but Q5 only works by effectively sacrificing other GPU usage.
They estimate that the quality gap for their main task (code generation on a private codebase) is roughly a 1–2 point improvement on HumanEval based on online benchmarks, which may or may not justify pushing the system to unstable “redline” conditions.
The post questions how practitioners make the day-to-day decision between Q4 and Q5 under similar hardware constraints, noting they keep changing their approach every few weeks.
The author suspects they may be overthinking the trade-off and is considering using a coin flip as a practical decision method.

ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray.

did the math on actual quality difference for my use case (mostly code generation on a private codebase). benchmarks online give me a 1-2 point delta on humaneval. that's not nothing but it's also not enough to tell me whether the q5 squeeze is worth running everything closer to the redline.

how do people running larger models day to day actually decide between q4 and q5 on this kind of setup. i keep flip-flopping every couple weeks and at this point i'm pretty sure i'm just overthinking it. probably going to flip a coin tomorrow.

submitted by /u/Practical_Low29
[link] [comments]