AI Navigate

Qwen3.5-9B finetune/export with Opus 4.6 reasoning distillation + mixed extras

Reddit r/LocalLLaMA / 3/23/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A new GGUF release for Qwen3.5-9B finetune/export is released, based on unsloth/Qwen3.5-9B and trained primarily on Opus-4.6-Reasoning-3000x-filtered with extra mixed data from xlam-function-calling-60k and OASST2.
  • The GGUF variants provided are Q4_K_M and Q8_0, with naming explained: opus46 indicates Opus 4.6 reasoning-distilled data, mix indicates additional datasets, and i1 indicates imatrix during quantization.
  • A first speed-focused benchmark on an RTX 4090 shows throughput numbers: Q4_K_M around 9838 tok/s (512 tokens) and 9749 tok/s (1024 tokens) for prompt processing, and 137.6 tok/s generation at 128 output tokens; Q8_0 around 9975 tok/s (512), 9955 tok/s (1024), and 92.4 tok/s generation at 128.
  • A quality benchmark on Q4_K_M using gsm8k reports flexible-extract exact_match 0.8415 and strict-match exact_match 0.84; the work is presented as a real train/export pipeline (LoRA training, merging, and GGUF generation) for local use, with a note that this is not a full multi-task quality table yet.

I just uploaded a new GGUF release here:

https://huggingface.co/slyfox1186/qwen35-9b-opus46-mix-i1-GGUF

This is my own Qwen 3.5 9B finetune/export project. The base model is unsloth/Qwen3.5-9B, and this run was trained primarily on nohurry/Opus-4.6-Reasoning-3000x-filtered, with extra mixed data from Salesforce/xlam-function-calling-60k and OpenAssistant/oasst2.

The idea here was pretty simple: keep a small local model, push it harder toward stronger reasoning traces and more structured assistant behavior, then export clean GGUF quants for local use.

The repo currently has these GGUFs:

  • Q4_K_M
  • Q8_0

In the name:

  • opus46 = primary training source was the Opus 4.6 reasoning-distilled dataset
  • mix = I also blended in extra datasets beyond the primary source
  • i1 = imatrix was used during quantization

I also ran a first speed-only llama-bench pass on my local RTX 4090 box. These are not quality evals, just throughput numbers from the released GGUFs:

  • Q4_K_M: about 9838 tok/s prompt processing at 512 tokens, 9749 tok/s at 1024, and about 137.6 tok/s generation at 128 output tokens
  • Q8_0: about 9975 tok/s prompt processing at 512 tokens, 9955 tok/s at 1024, and about 92.4 tok/s generation at 128 output tokens

Hardware / runtime for those numbers:

  • RTX 4090
  • Ryzen 9 7900X
  • llama.cpp build commit 6729d49
  • -ngl 99

I now also have a first real quality benchmark on the released Q4_K_M GGUF:

  • task: gsm8k
  • eval stack: lm-eval-harness -> local-completions -> llama-server
  • tokenizer reference: Qwen/Qwen3-8B
  • server context: 8192
  • concurrency: 4
  • result:
    • flexible-extract exact_match = 0.8415
    • strict-match exact_match = 0.8400

This was built as a real train/export pipeline, not just a one-off convert. I trained the LoRA, merged it, generated GGUFs with llama.cpp, and kept the naming tied to the actual training/export configuration so future runs are easier to track.

I still do not have a broader multi-task quality table yet, so I do not want to oversell it. This is mainly a release / build-log post for people who want to try it and tell me where it feels better or worse than stock Qwen3.5-9B GGUFs.

If anyone tests it, I would especially care about feedback on:

  • reasoning quality
  • structured outputs / function-calling style
  • instruction following
  • whether Q4_K_M feels like the right tradeoff vs Q8_0

If people want, I can add a broader multi-task eval section next, since right now I only have the first GSM8K quality pass plus the llama-bench speed numbers.

submitted by /u/RiverRatt
[link] [comments]