Apple Silicon上でのDFlash speculative decoding：Qwen3.5-9Bで4.1倍、いまオープンソース（MLX、M5 Max）

Reddit r/LocalLLaMA / 2026/4/14

📰 ニュースDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

要点

DFlash speculative decodingはApple SiliconのMLXにネイティブ実装され、ベンチマークが修正される形で書き直されました。その結果、Qwen3.5-9Bで2048トークン時に約4.1倍の速度向上を達成しつつ、高い受理率（約89%）を維持しています。
この手法は、生成を並列化します。まず、ドラフトモデルがブロック拡散（block diffusion）により16トークンを出力し、その後、ターゲットモデルでそれらを1回のフォワードパスで検証します。各トークンをコミットする前に検証することで、結果をロスレスに保ちます。
以前のドラフトからの更新点として、ベースラインを標準の`mlx_lm.stream_generate`に切り替えました。また、フルの状態チェックポイント化を避けるために、カスタムのMetalカーネルを用いたテープリプレイ（tape-replay）によるロールバックを追加しています。さらに、長いコンテキスト（N≥1024）に対して、JITの2パスSDPAカーネルで検証性能を改善しました。
追加のエンジニアリングとして、speculativeサイクル全体での数値的に安定なbf16パスを導入したことで、長い生成時の受理率が約82%から約89%へと向上しました。
トークン長1024/2048/4096の複数ケースにおける完全なベンチマーク結果は、MLX 0.31.1を用いたM5 Max（64GBユニファイドメモリ）での環境とともに、オープンソースコードと併せて公開されています。

DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

A few weeks ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing.

A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Every emitted token is verified against the target model before being committed. Lossless. Stock MLX, no fork.

Setup: M5 Max, 64GB, MLX 0.31.1. Baseline is stock mlx_lm.stream_generate, not a custom loop. 3 runs, median reported, 10s cooldown.

Results @ 2048 tokens

Model	Baseline	DFlash	Speedup	Acceptance
Qwen3.5-4B	53.74 tok/s	219.83 tok/s	4.10x	89.3%
Qwen3.5-9B	30.96 tok/s	127.07 tok/s	4.13x	89.4%
Qwen3.5-27B-4bit	32.35 tok/s	62.78 tok/s	1.90x	89.1%
Qwen3.5-35B-A3B-4bit	142.12 tok/s	240.21 tok/s	1.69x	88.7%

Full results at 1024/2048/4096 in the repo.

What changed since last post

Baseline is now stock mlx_lm (was a custom Python loop that was slower, inflating the speedup)
Tape-replay rollback: custom Metal kernel that replays only accepted steps through GatedDeltaNet recurrent state. No full checkpoint save/restore. This is what keeps acceptance at 89% over long generations.
JIT 2-pass SDPA kernel for long-context verify (N >= 1024)
Numerically stable bf16 paths across speculative cycles
Acceptance went from ~82% to ~89% thanks to precision fixes

What I learned

On unified memory everything is bandwidth-bound. Custom Metal kernels (batched-GEMV, fused gated SiLU, custom SDPA) all came back slower than stock MLX. The wins came from numerical precision, not compute optimization.

The 27B-4bit speedup is lower because the quantized target is already fast, making the bf16 draft the bottleneck. Structural limitation of speculative decoding on bandwidth-bound hardware with quantized targets.

Built specifically for Qwen3.5's hybrid GatedDeltaNet + attention architecture. Pure attention models (Qwen3, Gemma) work but without the tape-replay benefits.

Roadmap

Sustained acceptance at 4096+ tokens
Full-attention model optimization
Draft model compression

https://github.com/bstnxbt/dflash-mlx

submitted by /u/No_Shift_4543
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

エンタープライズ規模でのエージェント型コーディングには仕様駆動開発が不可欠

VentureBeat

スポーツベッティングにおけるサッカーの試合予測に挑む機械学習モデル

AI-SCHOLAR

日本三大秘境の現場で最先端技術の活用、建機の遠隔・自律操作

日経XTECH

Apple Silicon上でのDFlash speculative decoding：Qwen3.5-9Bで4.1倍、いまオープンソース（MLX、M5 Max）

要点

Results @ 2048 tokens

What changed since last post

What I learned

Roadmap

関連記事

Black Hat USA

Black Hat Asia

エンタープライズ規模でのエージェント型コーディングには仕様駆動開発が不可欠

スポーツベッティングにおけるサッカーの試合予測に挑む機械学習モデル

日本三大秘境の現場で最先端技術の活用、建機の遠隔・自律操作

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer