I just uploaded a new GGUF release here:
https://huggingface.co/slyfox1186/qwen35-9b-opus46-mix-i1-GGUF
This is my own Qwen 3.5 9B finetune/export project. The base model is unsloth/Qwen3.5-9B, and this run was trained primarily on nohurry/Opus-4.6-Reasoning-3000x-filtered, with extra mixed data from Salesforce/xlam-function-calling-60k and OpenAssistant/oasst2.
The idea here was pretty simple: keep a small local model, push it harder toward stronger reasoning traces and more structured assistant behavior, then export clean GGUF quants for local use.
The repo currently has these GGUFs:
Q4_K_MQ8_0
In the name:
opus46= primary training source was the Opus 4.6 reasoning-distilled datasetmix= I also blended in extra datasets beyond the primary sourcei1= imatrix was used during quantization
I also ran a first speed-only llama-bench pass on my local RTX 4090 box. These are not quality evals, just throughput numbers from the released GGUFs:
Q4_K_M: about9838 tok/sprompt processing at512tokens,9749 tok/sat1024, and about137.6 tok/sgeneration at128output tokensQ8_0: about9975 tok/sprompt processing at512tokens,9955 tok/sat1024, and about92.4 tok/sgeneration at128output tokens
Hardware / runtime for those numbers:
RTX 4090Ryzen 9 7900Xllama.cppbuild commit6729d49-ngl 99
I now also have a first real quality benchmark on the released Q4_K_M GGUF:
- task:
gsm8k - eval stack:
lm-eval-harness->local-completions->llama-server - tokenizer reference:
Qwen/Qwen3-8B - server context:
8192 - concurrency:
4 - result:
flexible-extract exact_match = 0.8415strict-match exact_match = 0.8400
This was built as a real train/export pipeline, not just a one-off convert. I trained the LoRA, merged it, generated GGUFs with llama.cpp, and kept the naming tied to the actual training/export configuration so future runs are easier to track.
I still do not have a broader multi-task quality table yet, so I do not want to oversell it. This is mainly a release / build-log post for people who want to try it and tell me where it feels better or worse than stock Qwen3.5-9B GGUFs.
If anyone tests it, I would especially care about feedback on:
- reasoning quality
- structured outputs / function-calling style
- instruction following
- whether
Q4_K_Mfeels like the right tradeoff vsQ8_0
If people want, I can add a broader multi-task eval section next, since right now I only have the first GSM8K quality pass plus the llama-bench speed numbers.
[link] [comments]




