|
I converted Mistral medium 3.5 128B to MLX 4bit. Eagle model for speculative decoding is not yet supported by MLX. Vision encoder included (full BF16 unquantized. Thinking mode works (reasoning_effort="high" gives you the [THINK]...[/THINK] chain), tool calling works, 256K context. There was a bug in mlx-vlm's mistral3 sanitize function: it wasn't stripping the model. prefix from vision tower and projector keys. This caused 438 parameters to be skipped. I patched it locally before converting. Details in the HF readme. I am getting ~5 tok/s on a 96 GB M2 Max. For sampling I recommend using temp 0.7 / top_p 0.95 / top_k 20 in reasoning mode, or temp 0.0–0.7 / top_p 0.8 for quick replies. Mistral recommends leaving repeat penalty disabled, but I am getting too many loops; I am not sure what the best value should be. [link] [comments] |
Mistral medium 3.5 128B, MLX 4bit, ~70 GB
Reddit r/LocalLLaMA / 5/1/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A Reddit post reports a conversion of Mistral Medium 3.5 128B into MLX 4-bit, with an estimated footprint of ~70 GB, and notes that the model currently appears “broken” and is not recommended for general use.
- The conversion details include that Eagle speculative decoding isn’t supported in MLX yet, while features such as a vision encoder (included unquantized BF16), reasoning (“Thinking” mode), tool calling, and a 256K context window are said to work.
- The author patched a bug in mlx-vlm’s mistral3 sanitize function related to vision tower and projector keys, which otherwise would skip 438 parameters during conversion.
- Performance is reported at roughly ~5 tokens/second on a 96 GB Mac M2 Max, and the post shares suggested sampling/reasoning hyperparameters along with guidance that repeat penalty is recommended to be disabled by Mistral but may need tuning due to looping.
- The author suggests downloading only if users want to help troubleshoot, and points to the Hugging Face README for conversion details and fixes.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale
Microsoft Research Blog
langchain-fireworks==1.2.1
LangChain Releases

How PolySignals Works: Full Breakdown of Its AI Signal Engine
Dev.to

AI-Powered Prediction Market Signals: The Complete Polymarket Trading Guide for 2026
Dev.to