| TLDR: 28 tok/s → 63 tok/s on Qwen3.6-27B on a MacBook Pro M5 Max. 2.24× faster at real temperature 0.6.Works for coding, creative writing, and chat https://i.redd.it/i9x794c0q7zg1.gif
What Is MTPLX?MTPLX uses a model's built-in MTP heads as speculative drafters to increase decode speeds on LLMs by up to 2.25x, all while preserving the model's default inference settings, allowing you to do coding or creative writing tasks. QWEN 3.6 27B @ 63 TPS on a MacBook Pro M5 MaxUsing MTPLX I increased decode speeds on Qwen 3.6 27B 4-bit MLX from 28 tok/s → 63 tok/s on a MacBook Pro M5 Max at temperature 0.6 with top_p 0.95 and top_k 20. The exact sampling settings Qwen recommends for coding. Qwen 3.6 27B ships with built-in MTP heads that support up to depth 5. I ran a sweep across D2, D3, D4, and D5 to find the optimal depth for this model on this hardware: D3 was the optimal spot, high enough acceptance to verify time ratio to where TPS increased the most. D4 and D5 have good acceptance at the early positions but the deeper positions start costing more in verify time than they save in accepted tokens. These results are at real temperature 0.6 with exact probability-ratio rejection sampling and residual correction. This means you can actually use Qwen 3.6 27B for real coding work with a 2.25x speed increase without sacrificing output quality. How Is This Different From DFlash / DDTree?DFlash MLX has greater absolute speed, however it is restricted to greedy (temp 0) only sampling which severely restricts its real world use case. It also requires an external drafter model which requires additional memory and needs to be created for every model that is released. DDTree adds tree-based verification on top of DFlash so it inherits the same limitations: greedy only, external drafter required. The reason for this comes down to how each system drafts. MTP heads draft sequentially. Each token sees the previous draft tokens, so every position produces a real probability distribution. DFlash drafts all 16 tokens simultaneously in a parallel diffusion pass. Token 8 does not know what token 7 is. Without that sequential dependency, there is no per-token probability distribution, which means you cannot do the rejection sampling maths that makes temperature work. MTPLX works with any model that retains the MTP heads and gives full customisability to the user to choose the number of MTP heads and run any locally saved or HuggingFace model with MTP heads. ArchitectureLayer 0: MLX Runtime MTPLX runs on a patched MLX fork. Stock MLX's quantised matrix-vector kernel is tuned for large M (prefill). During MTP verify, M is 3 to 6, one position per draft token. Stock stalls at these shapes. The patch: wider simdgroups, loop unrolling, 10 lines of Metal. Exact, 0.0 diff against stock. On top of the fork sit four custom Metal kernels registered as MLX primitives:
Layer 1: Single-model runtime One checkpoint. The target model and drafter are the same model. Qwen3.6-27B ships native MTP heads and MTPLX uses them. Zero RAM for a second model. The trunk's KV cache uses a committed-history contract verified against the vLLM CUDA reference at cosine > 0.9998 through depth 5. Layer 2: Speculative cycle (the hot loop) Per cycle: the MTP head drafts K tokens, each seeing the previous draft. The target verifies all K in one batched forward via a compiled GraphBank path. Probability-ratio acceptance (Leviathan-Chen) decides per position in fp32. Residual correction (p - q)+ emits a clean replacement on rejection. A bonus token falls out free when all K accept. The innovation tape commits accepted GDN state deltas and rolls back rejected ones. Layer 3: Serving stack Real API server. OpenAI-compatible /v1/chat/completions and /v1/completions with streaming SSE. Anthropic-compatible /v1/messages. /v1/models, /health, /metrics. Engine sessions with per-chat KV state. Session Bank preserves warm-prefix exact state across turns, verified at logits max_abs_diff = 0.0 against fresh forwards. Browser chat UI at localhost with live tok/s, markdown rendering, code-block copy, and stop button. Terminal chat via mtplx chat. What I Had To SolveNative MTP on Apple Silicon did not work by default. There were four stacked problems 1) Recursive depth collapse Running MTP recursively, accuracy collapses after depth 1: 91% → 63% → 44% → 27% → 17%. Everyone who tried native MTP saw this and gave up. I SSH'd into my 2x3090 PC running vLLM with MTP-5, traced the exact MTP execution, and compared it against MLX token-by-token. The finding: MLX was resetting the MTP attention KV cache every speculative cycle. vLLM does not. It persists MTP history across cycles. One contract fix: depth 2 acceptance jumped from 49% to 74%. 2) Precision mismatch Every project was using BF16 MTP heads on quantised 4-bit trunks. The MTP head is more precise than the hidden states it receives, which amplifies quantisation noise through recursive prediction. I grafted calibrated INT4 MTP weights onto the trunk, matching MTP precision to trunk precision. Depth 3 jumped from 30% to 88%. 3) MLX verify bottleneck Even with high acceptance, stock MLX's verify pass was so expensive that MTP was slower than plain autoregressive decode. MLP operations accounted for 51% of verify time. I patched MLX's Metal qmv shader for the small verify shapes MTP produces (10 lines, wider simdgroups + loop unrolling), built an innovation-tape GDN capture system for efficient state rollback, batched target probability distributions into a single MLX eval boundary, and deferred MTP history materialisation. Four stacked optimisations that cut verify cycle time from ~90ms to ~47ms per call, taking MTP from slower than plain autoregressive to 2.24× faster. 4) TPS decay On long responses (8k+ tokens), throughput collapsed. I spent 16 hours trying to figure out why TPS would decay from 50 to 25, a 50% decrease, investigating 24 different profiles: lazy-eval graph accumulation, cache growth, state provenance, paged attention, owned recurrent caches, two-pass Metal SDPA. None of them solved it. The problem was hilariously simple. It turns out the speculative decode loop sustains significantly heavier GPU load than normal autoregressive. Every cycle runs a full batched verify forward plus draft computation plus MTP history maintenance. The additional sustained workload was pushing the M5 Max SoC to 103°C, and macOS's default fan curve ramps far too late. By the time the fans respond, the GPU has already downclocked. I introduced a MAX mode into the CLI. Using ThermalForge, fans are locked at full speed before generation starts, with a detached watchdog that restores fans to auto if the process dies for any reason. TPS decay dropped from 50% to 6.7%, and GPU clock retention went from 85.6% to 97.1%. 16 hours of kernel debugging, solved by a fan controller. Caveats
In the meantime you can run my official Qwen 3.6 27B MTPLX Optimised from . The CLI makes it easy to set up and download. If you publish MLX quants, please keep the MTP heads. They are around 200MB on a 27B model, cost almost nothing in memory, and are now worth a 2.25× speedup. Really looking forward to everyone's thoughts and contributions to this project. Making local LLMs on MLX faster and more viable for everyone. [link] [comments] |
MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon
Reddit r/LocalLLaMA / 5/5/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- MTPLX is a native MTP (speculative decoding) inference engine for Apple Silicon that can speed up LLM decoding by up to about 2.25x while keeping the model’s default inference behavior.
- The project claims broad compatibility, using each model’s built-in MTP heads without adding an external drafter or extra memory overhead.
- It differs from other speculative-decoding approaches on Apple Silicon by using mathematically exact temperature sampling with rejection sampling rather than “greedy-only” drafting, enabling configurable temperatures for different tasks.
- In benchmarks, MTPLX reportedly improved Qwen3.6-27B on a MacBook Pro M5 Max from 28 tok/s to 63 tok/s at temperature 0.6, with Qwen-recommended sampling settings for coding and an optimized MTP depth.
- MTPLX includes a full CLI and a complete serving stack compatible with OpenAI and Anthropic APIs, plus benchmarking and diagnostics features for practical deployment.
Related Articles

Black Hat USA
AI Business

Tool-use API design for LLMs: 5 patterns that prevent agent loops and silent failures
Dev.to

Tool-use API design for LLMs: 5 patterns that prevent agent loops and silent failures
Dev.to

OpenMythos Sparks AI Race to Crack Anthropic’s Locked-Down Mythos
Dev.to

How to Visualizing the Case: AI Tools for Creating Clear Maps, Relationship Charts, and Evidence Boards
Dev.to