Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, and the speed of 17 t/s in my MacBook M3 Max is quite interesting, I would say we are into the usable zone.
What I did was to heavily quantize the routed experts to 2 bits using two different 2 bit quants to balance error and size. All the rest of the model, including the shared expert for each layer, is Q8: it is not worth it to play with the most sensible parts of the model if the bulk of the weights are in the routed experts.
I have the feeling that even 2 bit quantized this will prove to be a stronger model than Qwen 3.6 27B, but this is only a feeling based on the quality of the replies I get chatting with it. There is to experiment more and run benchmarks.
EDIT sorry for the CMake error, I produced the GGUF using a tool that I decided not to ship (not ready for prime time..., mostly a hack) instead of using the standard quantizer of llama.cpp. Now the problem is fixed. Also the inference in Metal is now 21 token/sec after some optimization.
EDIT2 also fixed the long context bug.
[link] [comments]




