SWE-bench results for different KV cache quantization levels

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A researcher ran SWE-bench-lite across multiple KV cache quantization levels and reports early results showing no visible performance difference between f16 and q8, with other quantization levels appearing noisy as well.
  • They observed high run-to-run variability and plan to repeat benchmarks across a broader model set to produce more concrete conclusions.
  • The project (Quantuzo) provides a public dashboard, repo, and results dataset with reproducible Docker Compose setup, versioned SWE-agent metadata, and stored logs/trajectories.
  • The author raises a potential benchmarking concern that SWE-bench–specific model training could influence evaluation outcomes, and they invite suggestions and alternative approaches.
  • They position the work as a practical reference for the community seeking KV-cache/VRAM-efficient ways to increase effective context window, and they offer support avenues like compute donations or running benchmarks on others’ hardware.

I have been running SWE-bench-lite across different KV cache quantization levels. I am still collecting data but I can share the early results.

Dashboard: https://huggingface.co/spaces/burakaydinofficial/Quantuzo

Repo: https://github.com/burakaydinofficial/Quantuzo

Results Dataset: https://huggingface.co/datasets/burakaydinofficial/Quantuzo

My early observations are there is no visible difference between f16 and q8. Results of other quantization levels are also looking like just noise. Random variety between runs. We will see more concrete results after I have all the benchmarks repeated across the model set.

Also I have another concern I have been tinkering with. SWE-bench is very well structured in my opinion but having the models trained specifically for this bench might also alter our benchmarks. It is very likely to have these benchmarks in the training sets. I will continue with swe-bench-lite for some time, since it is still respected and reliable but I am open for suggestions.

At current state we have some qwen3.5 models, glm-4.7-flash, nemotron 3 nano; some are benchmarked full spectrum of kv cache quantizations, some are just for reference.

Everything here is reproducible. It is very straightforward to run it via Docker Compose. SWE-agent is versioned and recorded in the metadata. All the logs and trajectories are stored in a public huggingface dataset. There are pull and push scripts for pulling all or subset of results. Also the result database is of course a public git repo. To push I believe I need to provide some permissions.

I am also open to support, whether that's compute donations, cloud credits, or just running benchmarks on your own hardware. Contributors will be credited on both the dashboard and repo.

Since most of the community have limited VRAM and looking for ways to increase context window, this can become a good reference. So all the inputs will be appreciated.

submitted by /u/burakodokus
[link] [comments]