Building on a LLM Quants Testing Site/Ressource - Sharing a few insights from first month, so you can share your thoughts and wishes for the future.

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author is building an LLM quantization testing resource to clarify how quantization quality impacts open-weight models on real-world work tasks.
They note that new models quickly generate many quantized variants, but there is a transparency gap about which quantizations are truly “good enough.”
After about one month of benchmarking, they report running roughly 10 tests per day and testing 268 quantization variants, with plans to add 50–100 more per week depending on GPU and efficiency.
Early example results focus on “vision reasoning” quantizations for models such as Qwen 3.5 35B (A3B), Gemma 4 26B (A4B), and Qwen 3.6 35B (A3B), including differences in token efficiency.
The project is motivated by the possibility that rising AI costs could make open-weight LLM understanding and evaluation more important for everyday users.

Building on a LLM Quants Testing Site/Ressource - Sharing a few insights from first month, so you can share your thoughts and wishes for the future.

Wanted to share some insights into a project I am building. The focus is to make it easier to understand how quantization affects open weights model on practical work tasks. For every new model being released it seems like there instantly comes our +200 quantizations released within the first couple of days. This is actually great, but I feel like we somewhat have a transparency gap into what is "good enough" when choosing an LLM quantization.

On the back on the current realization of "mainstream" AI might actually increase in cost, the future of open weights LLM models could become more relevant for the average person much sooner than we might think. If AI cost explodes - open weights AI understanding becomes much more important to support. So that is sort of the outset.

I have been working on a benchmarking test suite solution with focus on quantization quality and practical test case capability drop-off. The benchmark testing has been ongoing with running approx. 10 tests everyday for about a month. Starting out slow, to see if anything was breaking, while still building and working on optimizing a few things here and there. So far I have reached 268 quants tested in this first month. Intent is to keep adding quantization tests as per the capacity I have to spare. I expect to be adding about 50-100 new quantization test runs per week. Model efficiency plays a huge role in how fast I can cover additional quantizations as well as my own GPU availability.

E.g. Quants test results for Vision Reasoning of 79 Quantizations for:

Qwen 3.5 35B A3B vs. Gemma 4 26B A4B IT vs Qwen 3.6 35B-A3b

https://preview.redd.it/5ykdj36ah4zg1.png?width=956&format=png&auto=webp&s=466481e0d34503cfffa721065ec69eab8e17a9e0

Further - Efficiency (token usage) average results for the 3 models

https://preview.redd.it/4rcb8m85o4zg1.png?width=953&format=png&auto=webp&s=ae82030177c5573ed9869fb5dfa8a51ca41eeae8

Qwen 3.6 35B A3B is generally using way more tokens than 2 others - without delivering better results.

Take away : An AI model who "works" with fewer tokens could essentially be leveraged to run multiple loops over the same task to deliver even better results. AI model efficiency is a huge deal to dive into.

----

So far the following models has been tested:

qwen3.5-35b-a3b (22 quantizations tested)

gemma4-26b-a4b-it (24 quantizations tested)

qwen3.6-27b (14 quantizations tested)

qwen3.6-35b-a3b (33 quantizations tested)

qwen3.5-2b (26 quantizations tested)

qwen3.5-4b (26 quantizations tested)

qwen3.5-27b (24 quantizations tested)

gemma-4-e2b-it (24 quantizations tested)

gemma4-e4b-it (24 quantizations tested)

qwen3.5-0.8b (29 quantizations tested)

qwen3.5-9b (22 quantizations tested)

The hardware testing setup:

VPS server -> Tailscale Tunnel -> Windows PC w. RTX 5090 -> LM studio (server)

Looking into adding an Blackwell RTX 6000 to cover more types of quanitzed models.

Even though I consider adding a Blackwell RTX 6000 - then main idea is to focus on testing quantized models, which can be run on consumer GPU cards - So models up to around 32GB vram consumption is the main target. The idea with specifically adding this card is the close speed alignment between RTX 5090 and RTX 6000. This would make the ongoing capture of speed of tokens / second somewhat comparable, while if adding other types of setups, the real-world token / second capture might be skewed and not be equally valuable as a data point. LM Studio is not the fastest, but its a base-line, which everyone diving into AI can start with - without knowing much themselves.

The benchmark is built around 6 test suites:

- 64 tests with "Tool-Calls"

- 64 tests with "Instruction Following"

- 64 tests with "Structured Output"

- 64 tests with "Code Correctness"

- 64 tests with "Logic & Reasoning"

- 64 tests with "Vision Reasoning"

So all in all - Each and every quantization is tested against 384 test cases.

The tests are practical and are meant to be show where/how quantized models break - specifically in practical work, where you mix work disciplines.

Tests are built to only accept the specifically correct answer - in specific answer format.

E.g. - Raw test outputs from a single reasoning test :

// "<answer>no</answer>" :: Correct answer in correct format == correct

// "<answer>120</answer>" :: Wrong answer in correct format == wrong

// "Based on the visual evidence, no, the blister package has not been opened. The packaging shows multiple identical units of Paracetamol (Poro) tablets arranged vertically in a single row. There is no indication that the package was opened or that any tablet inside has been removed." :: Verbal explanation == wrong

// "No" :: Correct answer in wrong format == wrong

When the models are prompted with the question - they are nudged with the constraint of them only having 4096 output tokens available for their response - per test answer. So far the actual outputs showcases that the average correct answer per test consumes less than 10% of this "constraint".

To be able to deliver high quality data for ongoing analysis - I have implemented capture of all the information data points I could figure found meaningful to include - e.g. :

- Raw response output

- Tokens Input

- Tokens Output

- Latency in ms

- Token output speed

- Pass (Score - 4 test suites allow partially correct answers)

A website is available - It works fairly well on desktop (semi-well on mobile).

Website has a 64-pixel grid view "heatmap", for individual test case output inspection.

https://preview.redd.it/hrxot71dt4zg1.png?width=2153&format=png&auto=webp&s=966efc4ad4179ba915c1c16b677ff25daf5bd38b

Website has a history overview to see the latest test runs - updated live as tests run:

https://preview.redd.it/a9z6u2f7u4zg1.png?width=2153&format=png&auto=webp&s=a14b4c110ecb8149b25fa817d36cc02f14ea4626

I am working on a report builder - for anyone to make custom report on the data:

https://preview.redd.it/0r3tbpwiu4zg1.png?width=2151&format=png&auto=webp&s=81b9465a00d47cba8800480aff39a1f1bf435627

Hope you find the project and its intent useful. The idea is to help everyone out who has an interest in choosing a more data-driven path when selecting an LLM model quantization for their AI endeavours 😎

Ps. There is a ton of information to share about the project and test results. If you have a specific interest, please note it and I will try to prepare the next post writings more into the depth of these specific areas. There are no sponsors or monetization. Its driven by an interest in AI.

submitted by /u/norms_are_practical
[link] [comments]