Q8 Cache

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The post discusses whether improved cache quantization quality makes Q8 cache a generally good choice for local LLM inference.
  • It specifically asks about using Q8 cache for a 26B Gemma4 model, implying a need to balance quality and performance.
  • The discussion links to a llama.cpp pull request, suggesting the question is tied to recent changes in the project’s caching/quantization behavior.
  • The main takeaway is an evaluation/decision question for practitioners choosing quantization settings for better runtime output quality.

https://github.com/ggml-org/llama.cpp/pull/21038

Since now cache quantization has better quality, does that mean Q8 cache is a good choice now? For example for 26B Gemma4?

submitted by /u/Longjumping_Bee_6825
[link] [comments]