Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A developer reported getting DFlash speculative decoding working in llama.cpp with a VRAM-limited RTX 2080 SUPER 8GB using the Qwen3.5-35B-A3B model.
  • By combining MoE expert CPU offload (via -ncmoe tuning) with a small DFlash draft model, they overcame the fact that the 35B MoE model alone does not fit in 8GB VRAM.
  • Their baseline (non-DFlash) performance was about 26.8 tokens/s, while DFlash improved it to roughly 35.6–35.8 tokens/s, yielding around a 33–34% generation speedup.
  • They found a key tuning change: the optimal -ncmoe value shifted from 32 (baseline) to 34 when using DFlash.
  • Draft length parameter (--draft-max) required experimentation; the best results clustered around 5–7 with very high acceptance rates, while larger values reduced speed and acceptance.
## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: https://github.com/ggml-org/llama.cpp/pull/22105 Build tested: ```text 67cb0d507 (8942) 

Setup:

GPU: RTX 2080 SUPER 8GB Model: Qwen3.5-35B-A3B Q5_K_M Draft model: Qwen3.5-35B-A3B-DFlash Q4_K_M Backend: CUDA 

The main model is a 35B MoE GGUF around 24.44 GiB, so obviously it does not fit in 8GB VRAM. The trick was combining MoE expert CPU offload with DFlash.

Baseline

My best normal non-DFlash run was around:

~26.8 tok/s 

with roughly:

-ngl 999 -ncmoe 32 -fa 1 -ctk q8_0 -ctv q8_0 --no-mmap -t 5 

-ncmoe 32 was the best baseline point. Lower values used too much VRAM / performed worse, and higher values slowly reduced speed.

DFlash setup

For DFlash, I used:

Target model: C:\models\Qwen3.5-35B-A3B-Q5_K_M.gguf Draft model: C:\models\Qwen3.5-35B-A3B-DFlash-Q4_K_M.gguf 

The draft model is tiny compared to the target:

DFlash draft size: ~267.8 MiB Draft params: ~474M Draft quant: Q4_K_M 

Because the DFlash draft also needs VRAM, the best -ncmoe setting changed slightly. For the normal run, -ncmoe 32 was best. With DFlash, the sweet spot became:

-ncmoe 34 

Final command

This is the command I ended up using for testing:

build\bin\Release\llama-speculative-simple.exe ^ -m "C:\models\Qwen3.5-35B-A3B-Q5_K_M.gguf" ^ -md "C:\models\Qwen3.5-35B-A3B-DFlash-Q4_K_M.gguf" ^ --dflash ^ -p "Write a complete Python implementation of quicksort, mergesort, heapsort, and binary search. Include concise comments. Write code only." ^ -n 512 ^ --draft-max 6 ^ -cd 512 -c 4096 ^ --temp 0 --top-k 1 --seed 42 ^ -ngl 999 -ngld 99 -ncmoe 34 ^ -fa on ^ -ctk q8_0 -ctv q8_0 ^ -ctkd q8_0 -ctvd q8_0 ^ --no-mmap ^ -t 5 

Results

Typical DFlash result:

encoded 39 tokens in ~1.0 sec decoded 514 tokens in ~14.3-14.5 sec speed: ~35.6-35.8 tok/s n_draft = 6 n_predict = 514 n_drafted = 430 n_accept = 427 accept = 99.302% 

Compared to the baseline:

Normal: ~26.8 tok/s DFlash: ~35.6-35.8 tok/s Gain: ~1.33x 

So this gave me around a 33–34% generation speedup on an 8GB RTX 2080 SUPER.

Draft length tuning

I tested a few --draft-max values:

draft-max 5: ~34.6 tok/s, accept ~97.9% draft-max 6: ~35.6-36.9 tok/s, accept ~99.3% draft-max 7: ~35.7 tok/s, accept ~95.8% draft-max 8: ~34.1 tok/s, accept ~94.7% draft-max 12: ~31.5 tok/s, accept ~83.4% 

--draft-max 6 was the sweet spot. Higher values were not better because the acceptance rate dropped.

Quantization used

Target model:

Qwen3.5-35B-A3B-Q5_K_M.gguf file size: 24.44 GiB type: Q5_K_M 

Internally the target GGUF reports:

f32: 301 tensors q8_0: 312 tensors q5_K: 80 tensors q6_K: 40 tensors 

DFlash draft model:

Qwen3.5-35B-A3B-DFlash-Q4_K_M.gguf file size: 267.80 MiB type: Q4_K_M 

Internally the draft GGUF reports:

f32: 34 tensors q4_K: 49 tensors q6_K: 8 tensors 

KV cache:

Target KV: q8_0 / q8_0 Draft KV: q8_0 / q8_0 

I also tried lower draft KV quantization, but it did not really help:

draft KV q8_0/q8_0: ~35.8 tok/s draft KV q4_0/q4_0: ~35.6 tok/s 

So I kept draft KV at q8_0.

Notes / caveats

The PR/build I tested has some weird timing output in the perf summary, including negative total time and odd unaccounted memory values.

Because of that, I ignored those broken summary fields and focused on the stable parts:

decoded tokens / seconds accept rate n_draft / n_accept 

The generated text also shows that DFlash was actually being used:

n_draft = 6 n_drafted = 430 n_accept = 427 accept = 99.302% 

Also, the draft model was fully loaded on the GPU:

DFlash draft model buffer size = ~267.80 MiB offloaded 9/9 layers to GPU 

Bottom line

DFlash actually helped quite a bit here.

On my setup:

RTX 2080 SUPER 8GB Qwen3.5-35B-A3B Q5_K_M DFlash draft Q4_K_M MoE CPU offload llama.cpp PR #22105 

I went from about:

26.8 tok/s 

to about:

35.6-35.8 tok/s 

Best current settings:

-ncmoe 34 --draft-max 6 -fa on -ctk q8_0 -ctv q8_0 -ctkd q8_0 -ctvd q8_0 --no-mmap -t 5 

Pretty happy with this result, especially considering the GPU only has 8GB VRAM.

submitted by /u/jwestra
[link] [comments]