Marco-Mini (17.3B, 0.86B active) and Marco-Nano (8B, 0.6B active) by Alibaba

Reddit r/LocalLLaMA / 4/10/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • Alibaba International Digital Commerce has released two new instruction-tuned sparse Mixture-of-Experts (MoE) multilingual LLMs on Hugging Face: Marco-Mini-Instruct (17.3B parameters, ~0.86B active per token) and Marco-Nano-Instruct (8B parameters, ~0.6B active per token).
  • Marco-Mini-Instruct activates about 5% of its parameters per token (0.86B active) and is reported to achieve top average benchmark performance across English, multilingual general, and multilingual cultural tests versus comparable instruct models.
  • Marco-Nano-Instruct activates about 7.5% per token (0.6B active) yet is reported to outperform the average performance of comparable instruct models with up to ~3.84B activated parameters.
  • The models emphasize efficiency via extreme sparsity, with Marco-Mini-Instruct described as having 256 experts and using 8 active experts per token, and both variants described as using a post-training pipeline including SFT and online policy distillation.
  • Both releases are offered under the Apache 2.0 license and support a reported 29-language multilingual capability.

Looks like these were released six days ago. Did a search and didn't see a post about them.

https://huggingface.co/AIDC-AI/Marco-Mini-Instruct

https://huggingface.co/AIDC-AI/Marco-Nano-Instruct

Pretty wild parameter/active ratio, should be lightning fast.

Marco-Mini-Instruct is the instruction-tuned variant of Marco-Mini-Base, a highly sparse Mixture-of-Experts (MoE) multilingual language model from the Marco-MoE family, developed by Alibaba International Digital Commerce. It activates only 0.86B out of 17.3B total parameters (5% activation ratio) per token. Marco-Mini-Instruct achieves the best average performance across English, multilingual general, and multilingual cultural benchmarks when compared against instruct models with up to 12B activated parameters, including Qwen3-4B-Instruct, Ministral3-8B-Instruct, Gemma3-12B-Instruct, LFM2-24B-A2B, and Granite4-Small-Instruct.


Marco-Nano-Instruct is the post-trained variant of Marco-Nano-Base, a highly sparse Mixture-of-Experts (MoE) multilingual language model from the Marco-MoE family, developed by Alibaba International Digital Commerce. It activates only 0.6B out of 8B total parameters (7.5% activation ratio) per token. Despite its extreme sparsity, Marco-Nano-Instruct achieves the best average performance across English, multilingual general, and multilingual cultural benchmarks among all comparable instruct models up to 3.84B activated parameters.

https://xcancel.com/ModelScope2022/status/2042084482661191942

https://pbs.twimg.com/media/HFbvyB-WsAAayv1.jpg?name=orig

Meet Marco-Mini-Instruct: a highly sparse MoE multilingual model from Alibaba International. 17.3B total params, only 0.86B active (5% activation ratio). 🚀

Beats Qwen3-4B, Gemma3-12B, Granite4-Small on English, multilingual general, and cultural benchmarks — with a fraction of their active params.

🌍 29 languages: Arabic, Turkish, Kazakh, Bengali, Nepali and more

🧠 256 experts, 8 active per token. Drop-Upcycling from Qwen3-0.6B-Base.

🎯 2-stage post-training: SFT + Online Policy Distillation (Qwen3-30B → Qwen3-Next-80B cascade)

✅ Apache 2.0

submitted by /u/AnticitizenPrime
[link] [comments]