🚀 Big update for the LocalLlama community: Multi-Token Prediction (MTP) is coming to mlx-lm for the qwen-3.5 series.
(not my PR, just sharing because this is cool 👇)
Early support for generating multiple tokens per forward pass is in, and the gains already look solid:
• 15.3 → 23.3 tok/s (~1.5x throughput boost)
• ~80.6% acceptance rate
The author of the PR benchmarked with Qwen3.5-27B 4-bit on an M4 Pro.
Huge kudos to AirRunner for contributing this 🙌
PR: https://github.com/ml-explore/mlx-lm/pull/990
[link] [comments]
