DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

Reddit r/LocalLLaMA / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

投稿者はMLXでApple Silicon（M5 Max, 64GB）向けにDFlashのネイティブ実装を進め、ドラフト生成とターゲット検証を効率化する方式を示しています。
ドラフトモデルが16トークンを並列生成し、ターゲット側が1回のフォワードパスでそれらを検証することで、ベースライン（greedy exact argmax）とビット単位で一致する出力を実現したと述べています。
Qwen3.5-9B（bf16）では1024/2048トークン生成でそれぞれ約3.3x/3.1xの速度向上（26 tok/s→85 tok/s など）を報告しています。
Qwen3.5-4B（bf16）でも2.7x〜3.2xの改善が見られ、4Bは長いコンテキストでむしろ速度が伸びる傾向を示しました。
Qwen3.5-27Bを量子化（8bit/4bit）してもDFlashはベースラインより高速である一方、量子化度が下がるほど（4bit）スピードアップは小さくなる例が示されています。

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

I'm building a native MLX implementation of DFlash (paper) for Apple Silicon. A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Output is bit-for-bit identical to baseline (greedy exact argmax match).

Setup: M5 Max, 64GB, MLX, no CUDA.

Results

Qwen3.5-9B bf16

Gen length	DFlash	Baseline	Speedup
1024 tokens	85 tok/s	26 tok/s	3.3x
2048 tokens	80 tok/s	26 tok/s	3.1x

Qwen3.5-4B bf16

Gen length	DFlash	Baseline	Speedup
1024 tokens	109 tok/s	41 tok/s	2.7x
2048 tokens	133 tok/s	42 tok/s	3.2x

The 4B actually gets faster at longer generation. The model is small enough that the draft/verify balance stays healthy as context grows.

Qwen3.5-27B quantized

Quant	Gen length	DFlash	Baseline	Speedup
8bit	1024 tokens	35 tok/s	14 tok/s	2.5x
8bit	2048 tokens	26 tok/s	11 tok/s	2.3x
4bit	1024 tokens	44 tok/s	24 tok/s	1.9x
4bit	2048 tokens	40 tok/s	23 tok/s	1.7x

8bit gives better speedup ratios than 4bit. int4 makes the verify so fast that the bf16 draft becomes the bottleneck. With int8, the draft/verify balance is healthier.

All numbers are generation only (first token to last token, no prefill). Acceptance around 80-87% across all models.

What I built

No DFlash MLX implementation existed. I wrote the runtime from scratch. What actually moved the numbers:

head_dim=256 patch. Qwen3.5-9B uses head_dim=256, which MLX's steel_attention didn't support. A 2-line patch unlocked the fast SDPA path.

Sync elision. Restructured the pipeline from 2 GPU→CPU syncs per cycle to 1. At 80+ tok/s each sync costs ~0.5ms.

Packed QKV projection. 3 matmuls → 1 matmul + split. Fewer kernel dispatches per layer.

Lessons on Apple Silicon

On unified memory everything is bandwidth-bound, which changes the speculative decoding game:

Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back 0.5 to 0.8x slower than stock MLX steel GEMM. Ended up reverting all of them.

Verify cost is almost flat from 4 to 16 tokens (57ms vs 59ms). Weight loading dominates, not token count. "Verify fewer tokens when confidence is low" doesn't help here.

On quantized models, the optimization landscape flips: the draft (bf16) becomes slower than the verify (int4/int8). This is the opposite of the bf16 case and is a structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets.

Currently working on

Draft compression/distillation for the 27B to fix the bf16 draft bottleneck on quantized targets.

Long context stability. Speedup degrades past 2K tokens due to KV cache growth.

MoE models. DFlash drafts exist for Qwen3.5-35B-A3B (35B total, 3B active). Verify cost of a small model, quality of a large one.

Everything is still very much under construction. Will open source when ready.

submitted by /u/No_Shift_4543
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/12DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Best AI Video Generator in 2026: Top Tools Tested & Compared

Dev.to

The Future of Agent Integration: A2A vs ANP and the Three-Layer Security Architecture

Dev.to

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

Key Points

Results

What I built

Lessons on Apple Silicon

Currently working on

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Best AI Video Generator in 2026: Top Tools Tested & Compared

The Future of Agent Integration: A2A vs ANP and the Three-Layer Security Architecture

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer