Qwen 3.5 9B LLM GGUF quantized for local structured extraction

Reddit r/LocalLLaMA / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article describes a Q4_K_M GGUF quantization of the acervo-extractor-qwen3.5-9b model, tailored for structured extraction from invoices, contracts, and financial reports.
  • Compared with the float16 baseline, the quantized model reduces disk usage to 4.7 GB (about 26%), lowers peak RAM to 5.7 GB, and runs slightly faster with 47.8 tok/s vs 42.7 tok/s.
  • Latency improves as well, with mean latency dropping to 20.9 ms from 23.4 ms and P95 to 26.9 ms from 30.2 ms, while perplexity changes moderately (19.54 vs 18.43).
  • It provides example inference code using llama.cpp to run local extraction tasks (including an air-gapped use-case for sensitive financial/legal documents).
  • The repo includes a full quantization pipeline and benchmark scripts, and an additional Q8_0 variant is referenced, with a Hugging Face model link provided for download.

The gap between "this fine-tune does exactly what I need" and "this fine-tune actually runs on my hardware" for structured extraction use-case is where most specialized models die.

To fix this, we quantized acervo-extractor-qwen3.5-9b to Q4_K_M. It's a 9B Qwen 3.5 model fine-tuned for structured data extraction from invoices, contracts, and financial reports.

Benchmark vs float16:

- Disk: 4.7 GB vs 18 GB (26% of original)

- RAM: 5.7 GB vs 20 GB peak

- Speed: 47.8 tok/s vs 42.7 tok/s (1.12x)

- Mean latency: 20.9 ms vs 23.4 ms | P95: 26.9 ms vs 30.2 ms

- Perplexity: 19.54 vs 18.43 (+6%)

Usage with llama-cpp :

llm = Llama(model_path="acervo-extractor-qwen3.5-9b-Q4_K_M.gguf", n_ctx=2048) output = llm("Extract key financial metrics from: [doc]", max_tokens=256, temperature=0.1) 

What this actually unlocks:

A task-specific extraction model running air-gapped. For pipelines handling sensitive financial or legal documents, local inference isn't a preference, it's a requirement.

Q8_0 also in the repo: 10.7 GB RAM, 22.1 ms mean latency, perplexity 18.62 (+1%).

Model on Hugging Face:

https://huggingface.co/daksh-neo/acervo-extractor-qwen3.5-9b-GGUF

FYI: Full quantization pipeline and benchmark scripts included. Adapt it for any model in the same family.

submitted by /u/gvij
[link] [comments]