Experiment: Olmo 3 7B Instruct Q1_0

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

低ビット量子化（Bonsaiの1-bit形式）を目指し、OLMo-3 7B Instruct を量子化する試行では、実行可能性の高い手法として「量子化 aware distillation」を採用した。
4×B200 GPUで約12時間学習したが、予算の都合で途中停止した結果、短いシーケンスでは英語出力は可能でも、反復ループに陥りやすく文脈追跡もほぼできず実用には至らなかった。
蒸留の実装として distilkit をフォークし、GGUFへのエクスポート手順を組み込んだうえで、短いDPOステップを追加して軽微な改善（または判別困難）を得た。
実行には、CUDAバックエンドがllama.cppに未実装のためBonsaiのllama.cppフォーク（PrismML-Eng/Bonsai-demo）を使用する必要があると述べている。
distillationコードは DistillKit リポジトリを参照しており、ロジックやバグの可能性（AI生成である可能性）にも注意を促している。

I tried to quantize OLMo-3 7B Instruct into a Bonsai 1-bit format. After looking into different approaches I landed on quantization aware distillation, which seemed like the most viable path to get a usable 1-bit model.

The model was trained on 4x B200 GPUs for about 12 hours. Unfortunately I had to stop way too early due to budget constraints. At this point it can produce English and some basic outputs on short sequences, but it is generally not usable. It falls into repetition loops quickly and has almost no context tracking. I believe these issues would have resolved with more training time and a better dataset choice, I picked the wrong one.

https://preview.redd.it/zm28xup2ouug1.jpg?width=2156&format=pjpg&auto=webp&s=c43b5f133acf36363ea8f5814cbd92a5d2b0fa34

For the distillation I forked the distilkit library and made some additions. It is easy to use and the repo includes scripts to export directly to GGUF. I also ran a very short DPO step afterward, there were minor improvements, or maybe not, hard to tell.

To run it you need to use the Bonsai llama.cpp fork at PrismML-Eng/Bonsai-demo since the CUDA backend has not been added to llama.cpp yet. For the distillation code see https://github.com/cturan/DistillKit (all written by AI, there may be hallucinated logic and bugs). If you have questions just ask an LLM lol.

submitted by /u/butlan
[link] [comments]