In the recent kv rotation PR it was found that the existing q8 kv quants tank performance on AIME25, but can be recovered mostly with rotation

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

A recent “KV rotation” PR in llama.cpp found that existing Q8 KV quantization can significantly reduce performance on the AIME25 benchmark.
The reported performance drop is largely recoverable by applying the KV rotation technique rather than relying solely on the prior Q8 KV quantization setup.
The discussion suggests potential value for users already running Q8 KV quantized models, though at least one commenter plans to stay on FP16 for now.
The takeaways mainly affect local LLM inference quality/benchmark outcomes when using quantized KV-cache strategies, implying adjustments to quantization workflows may be needed.
Overall, the thread highlights an optimization/technique that can improve the practical tradeoff between memory/compute efficiency (Q8) and accuracy (benchmark performance).

I think this could be great for existing q8 users. Personally I'll be sticking with fp16 for the foreseeable future.

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

AI Business

Dev.to

Dev.to

Dev.to

Dev.to