DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

Reddit r/LocalLLaMA / 4/14/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

DFlash speculative decoding has been implemented natively in Apple Silicon’s MLX and rewritten with corrected benchmarking, yielding about 4.1x speedups on Qwen3.5-9B at 2048 tokens while maintaining high acceptance rates (~89%).
The approach parallelizes generation by having a draft model emit 16 tokens via block diffusion, then verifies them in a single forward pass with the target model before committing each token to keep results lossless.
Updates since the earlier draft include switching the baseline to stock `mlx_lm.stream_generate`, adding tape-replay rollback with a custom Metal kernel to avoid full state checkpointing, and improving verification performance with a JIT 2-pass SDPA kernel for long contexts (N≥1024).
Additional engineering work—such as numerically stable bf16 paths across speculative cycles—improved acceptance from ~82% to ~89% during long generations.
The full benchmark results for multiple token lengths (1024/2048/4096) are published alongside open-source code, using MLX 0.31.1 on an M5 Max with 64GB unified memory.

DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

A few weeks ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing.

A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Every emitted token is verified against the target model before being committed. Lossless. Stock MLX, no fork.

Setup: M5 Max, 64GB, MLX 0.31.1. Baseline is stock mlx_lm.stream_generate, not a custom loop. 3 runs, median reported, 10s cooldown.

Results @ 2048 tokens

Model	Baseline	DFlash	Speedup	Acceptance
Qwen3.5-4B	53.74 tok/s	219.83 tok/s	4.10x	89.3%
Qwen3.5-9B	30.96 tok/s	127.07 tok/s	4.13x	89.4%
Qwen3.5-27B-4bit	32.35 tok/s	62.78 tok/s	1.90x	89.1%
Qwen3.5-35B-A3B-4bit	142.12 tok/s	240.21 tok/s	1.69x	88.7%

Full results at 1024/2048/4096 in the repo.

What changed since last post

Baseline is now stock mlx_lm (was a custom Python loop that was slower, inflating the speedup)
Tape-replay rollback: custom Metal kernel that replays only accepted steps through GatedDeltaNet recurrent state. No full checkpoint save/restore. This is what keeps acceptance at 89% over long generations.
JIT 2-pass SDPA kernel for long-context verify (N >= 1024)
Numerically stable bf16 paths across speculative cycles
Acceptance went from ~82% to ~89% thanks to precision fixes

What I learned

On unified memory everything is bandwidth-bound. Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back slower than stock MLX. The wins came from numerical precision, not compute optimization.

The 27B-4bit speedup is lower because the quantized target is already fast, making the bf16 draft the bottleneck. Structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets.

Built specifically for Qwen3.5's hybrid GatedDeltaNet + attention architecture. Pure attention models (Qwen3, Gemma) work but without the tape-replay benefits.