Qwen 3.5 9B LLM GGUF quantized for local structured extraction

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article describes a Q4_K_M GGUF quantization of the acervo-extractor-qwen3.5-9b model, tailored for structured extraction from invoices, contracts, and financial reports.
Compared with the float16 baseline, the quantized model reduces disk usage to 4.7 GB (about 26%), lowers peak RAM to 5.7 GB, and runs slightly faster with 47.8 tok/s vs 42.7 tok/s.
Latency improves as well, with mean latency dropping to 20.9 ms from 23.4 ms and P95 to 26.9 ms from 30.2 ms, while perplexity changes moderately (19.54 vs 18.43).
It provides example inference code using llama.cpp to run local extraction tasks (including an air-gapped use-case for sensitive financial/legal documents).
The repo includes a full quantization pipeline and benchmark scripts, and an additional Q8_0 variant is referenced, with a Hugging Face model link provided for download.

The gap between "this fine-tune does exactly what I need" and "this fine-tune actually runs on my hardware" for structured extraction use-case is where most specialized models die.

To fix this, we quantized acervo-extractor-qwen3.5-9b to Q4_K_M. It's a 9B Qwen 3.5 model fine-tuned for structured data extraction from invoices, contracts, and financial reports.

Benchmark vs float16:

- Disk: 4.7 GB vs 18 GB (26% of original)

- RAM: 5.7 GB vs 20 GB peak

- Speed: 47.8 tok/s vs 42.7 tok/s (1.12x)

- Mean latency: 20.9 ms vs 23.4 ms | P95: 26.9 ms vs 30.2 ms

- Perplexity: 19.54 vs 18.43 (+6%)

Usage with llama-cpp :

llm = Llama(model_path="acervo-extractor-qwen3.5-9b-Q4_K_M.gguf", n_ctx=2048) output = llm("Extract key financial metrics from: [doc]", max_tokens=256, temperature=0.1)

What this actually unlocks:

A task-specific extraction model running air-gapped. For pipelines handling sensitive financial or legal documents, local inference isn't a preference, it's a requirement.

Q8_0 also in the repo: 10.7 GB RAM, 22.1 ms mean latency, perplexity 18.62 (+1%).

Model on Hugging Face:

https://huggingface.co/daksh-neo/acervo-extractor-qwen3.5-9b-GGUF

FYI: Full quantization pipeline and benchmark scripts included. Adapt it for any model in the same family.

submitted by /u/gvij
[link] [comments]