llama.cpp DeepSeek v4 Flash experimental inference

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post shares an experimental integration of DeepSeek v4 with llama.cpp, along with a GGUF checkpoint intended to run inference using about 128GB of RAM.
  • The author reports that the model performs well even when routed experts are heavily quantized to 2 bits, while the remaining shared components are kept at Q8 to manage quality and size tradeoffs.
  • In limited testing on a MacBook M3 Max, the author observed promising speed (initially ~17 tokens/sec, later improved to ~21 tokens/sec after Metal-related optimizations).
  • The writer notes uncertainty about ultimate model quality but suggests it may outperform Qwen 3.6 27B in conversational reply quality, pending more benchmarks.
  • Follow-up edits indicate the author fixed a CMake error that came from using a non-standard GGUF generation tool and also resolved a long-context bug.

Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, and the speed of 17 t/s in my MacBook M3 Max is quite interesting, I would say we are into the usable zone.

What I did was to heavily quantize the routed experts to 2 bits using two different 2 bit quants to balance error and size. All the rest of the model, including the shared expert for each layer, is Q8: it is not worth it to play with the most sensible parts of the model if the bulk of the weights are in the routed experts.

I have the feeling that even 2 bit quantized this will prove to be a stronger model than Qwen 3.6 27B, but this is only a feeling based on the quality of the replies I get chatting with it. There is to experiment more and run benchmarks.

EDIT sorry for the CMake error, I produced the GGUF using a tool that I decided not to ship (not ready for prime time..., mostly a hack) instead of using the standard quantizer of llama.cpp. Now the problem is fixed. Also the inference in Metal is now 21 token/sec after some optimization.

EDIT2 also fixed the long context bug.

submitted by /u/antirez
[link] [comments]