Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open

Reddit r/LocalLLaMA / 5/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Cohere’s Command A+ (218B total / 25B active, 128 experts, Apache 2.0) has been ported to Apple Silicon via an mlx-lm implementation with a PR opened for review.
The write-up documents key architectural details such as a sigmoid-based (not softmax) top-8 routing with normalization, plus a shared-expert/intermediate design that combines routed and shared outputs.
The port includes specific attention/positioning choices, including a 3:1 sliding window approach and RoPE applied only to sliding layers, along with parallel attention+MLP blocks sharing the same LayerNorm.
A quantization-related pitfall is highlighted: W4A4 checkpoint biases are NVFP4 quantization artifacts, while the BF16 model is bias-free; sanitize() is used to handle both formats.
Local validation was limited by memory (M3 Max 128GB couldn’t run W4A4 needing ~132GB), but a larger setup reported successful generation/tool-calling and multi-turn performance with high token/s rates.

Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open

Cohere dropped Command A+ on the 20th (218B total / 25B active, 128 experts top-8, Apache 2.0). Wrote a cohere2_moe implementation for mlx-lm to get it running on Apple Silicon.

Architecture notes for anyone digging into this model:

- Single shared expert with a larger intermediate (16384 = 4096×4) combined with the routed output via (routed + shared)/2

- Sigmoid routing (not softmax), normalized top-8

- Sliding window 3:1 (3 sliding + 1 full), interleaved RoPE on sliding layers only

- Parallel attn+MLP block off the same LayerNorm

- Gotcha that cost me a few iterations: the biases in the W4A4 checkpoint are NVFP4 quantization artifacts — the BF16 model is entirely bias-free. sanitize() handles both formats.

I couldn't validate locally (W4A4 needs ~132GB, my M3 Max is 128). https://github.com/vlbosch ran it on a bigger box: BF16→Q8 conversion + clean generation, tool calling, multi-turn with KV-cache continuation, 22.9 tok/s gen / 57.6 tok/s prompt, 241GB peak.

PR is open on ml-explore/mlx-lm (in review). Happy to take feedback or fixes — and if someone with 192GB+ wants to test the W4A4 path directly, would love the error output.
https://github.com/ml-explore/mlx-lm/pull/1294

https://preview.redd.it/wvwa6irg6y2h1.png?width=3006&format=png&auto=webp&s=52c0a56ff7bc6ea0dec7fd4e43e79d7525047c1c

submitted by /u/Remarkable_Jicama775
[link] [comments]