| A few weeks ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing. A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Every emitted token is verified against the target model before being committed. Lossless. Stock MLX, no fork. Setup: M5 Max, 64GB, MLX 0.31.1. Baseline is stock Results @ 2048 tokens
Full results at 1024/2048/4096 in the repo. What changed since last post
What I learnedOn unified memory everything is bandwidth-bound. Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back slower than stock MLX. The wins came from numerical precision, not compute optimization. The 27B-4bit speedup is lower because the quantized target is already fast, making the bf16 draft the bottleneck. Structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets. Built specifically for Qwen3.5's hybrid GatedDeltaNet + attention architecture. Pure attention models (Qwen3, Gemma) work but without the tape-replay benefits. Roadmap
[link] [comments] |
DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)
Reddit r/LocalLLaMA / 4/14/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- DFlash speculative decoding has been implemented natively in Apple Silicon’s MLX and rewritten with corrected benchmarking, yielding about 4.1x speedups on Qwen3.5-9B at 2048 tokens while maintaining high acceptance rates (~89%).
- The approach parallelizes generation by having a draft model emit 16 tokens via block diffusion, then verifies them in a single forward pass with the target model before committing each token to keep results lossless.
- Updates since the earlier draft include switching the baseline to stock `mlx_lm.stream_generate`, adding tape-replay rollback with a custom Metal kernel to avoid full state checkpointing, and improving verification performance with a JIT 2-pass SDPA kernel for long contexts (N≥1024).
- Additional engineering work—such as numerically stable bf16 paths across speculative cycles—improved acceptance from ~82% to ~89% during long generations.
- The full benchmark results for multiple token lengths (1024/2048/4096) are published alongside open-source code, using MLX 0.31.1 on an M5 Max with 64GB unified memory.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Agentic coding at enterprise scale demands spec-driven development
VentureBeat

How to build effective reward functions with AWS Lambda for Amazon Nova model customization
Amazon AWS AI Blog

How 25 Students Went from Idea to Deployed App in 2 Hours with Google Antigravity
Dev.to