Mistral medium 3.5 128B, MLX 4bit, ~70 GB

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A Reddit post reports a conversion of Mistral Medium 3.5 128B into MLX 4-bit, with an estimated footprint of ~70 GB, and notes that the model currently appears “broken” and is not recommended for general use.
The conversion details include that Eagle speculative decoding isn’t supported in MLX yet, while features such as a vision encoder (included unquantized BF16), reasoning (“Thinking” mode), tool calling, and a 256K context window are said to work.
The author patched a bug in mlx-vlm’s mistral3 sanitize function related to vision tower and projector keys, which otherwise would skip 438 parameters during conversion.
Performance is reported at roughly ~5 tokens/second on a 96 GB Mac M2 Max, and the post shares suggested sampling/reasoning hyperparameters along with guidance that repeat penalty is recommended to be disabled by Mistral but may need tuning due to looping.
The author suggests downloading only if users want to help troubleshoot, and points to the Hugging Face README for conversion details and fixes.

Mistral medium 3.5 128B, MLX 4bit, ~70 GB

This model seems utterly broken for now. I do not recommend downloading or using it, unless you are planning to help troubleshoot it. This is not a problem with the conversion, but with the model itself.

I converted Mistral medium 3.5 128B to MLX 4bit. Eagle model for speculative decoding is not yet supported by MLX.

Vision encoder included (full BF16 unquantized. Thinking mode works (reasoning_effort="high" gives you the [THINK]...[/THINK] chain), tool calling works, 256K context.

There was a bug in mlx-vlm's mistral3 sanitize function: it wasn't stripping the model. prefix from vision tower and projector keys. This caused 438 parameters to be skipped. I patched it locally before converting. Details in the HF readme.

I am getting ~5 tok/s on a 96 GB M2 Max. For sampling I recommend using temp 0.7 / top_p 0.95 / top_k 20 in reasoning mode, or temp 0.0–0.7 / top_p 0.8 for quick replies. Mistral recommends leaving repeat penalty disabled, but I am getting too many loops; I am not sure what the best value should be.

submitted by /u/ex-arman68
[link] [comments]