DTree on MLX ... tiny win over DFlash on Qwen3.5-4B (M2)..

Reddit r/LocalLLaMA / 4/16/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A developer ported DTree to MLX and reports a small but repeatable speed improvement on an M2 Max 32GB setup using Qwen3.5-4B (q4_g64), where DTree reaches 48.31 e2e tok/s versus DFlash’s 45.07 e2e tok/s (about 1.07x).
  • The author notes that many other experimental configurations tried on MLX were flat or worse, suggesting the current improvement is narrow but real enough to share.
  • They conclude that verifier-side cost in MLX remains the primary bottleneck limiting larger DTree gains.
  • The post links to the project repository (dtree-mlx) and asks the community whether anyone has achieved bigger DTree performance improvements on MLX.

I ported DTree to MLX ... and finally got one setting that seems to beat matched DFlash locally.

M2 Max 32GB, Qwen3.5-4B, q4_g64, spec=16, tree_budget=24 - DFlash: 45.07 e2e tok/s - DTree: 48.31 e2e tok/s 

So basically ~1.07x over DFlash. Not massive, but at least it looks real and repeatable enough to mention.

A lot of the other things I tried were flat or just worse, so my current read is that MLX verifier cost is still the main limiter here.

anyone has gotten bigger DTree gains on MLX?

https://github.com/DrHB/dtree-mlx

submitted by /u/naftalinus
[link] [comments]