The gap between "this fine-tune does exactly what I need" and "this fine-tune actually runs on my hardware" for structured extraction use-case is where most specialized models die.
To fix this, we quantized acervo-extractor-qwen3.5-9b to Q4_K_M. It's a 9B Qwen 3.5 model fine-tuned for structured data extraction from invoices, contracts, and financial reports.
Benchmark vs float16:
- Disk: 4.7 GB vs 18 GB (26% of original)
- RAM: 5.7 GB vs 20 GB peak
- Speed: 47.8 tok/s vs 42.7 tok/s (1.12x)
- Mean latency: 20.9 ms vs 23.4 ms | P95: 26.9 ms vs 30.2 ms
- Perplexity: 19.54 vs 18.43 (+6%)
Usage with llama-cpp :
llm = Llama(model_path="acervo-extractor-qwen3.5-9b-Q4_K_M.gguf", n_ctx=2048) output = llm("Extract key financial metrics from: [doc]", max_tokens=256, temperature=0.1) What this actually unlocks:
A task-specific extraction model running air-gapped. For pipelines handling sensitive financial or legal documents, local inference isn't a preference, it's a requirement.
Q8_0 also in the repo: 10.7 GB RAM, 22.1 ms mean latency, perplexity 18.62 (+1%).
Model on Hugging Face:
https://huggingface.co/daksh-neo/acervo-extractor-qwen3.5-9b-GGUF
FYI: Full quantization pipeline and benchmark scripts included. Adapt it for any model in the same family.
[link] [comments]




