Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

arXiv cs.CL / 4/29/2026

📰 NewsIndustry & Market MovesModels & Research

共有:

Key Points

Marco-MoE is a fully open multilingual sparse Mixture-of-Experts (MoE) model suite designed to activate only about 5% of total parameters per input token.
The approach combines extreme sparsity with “upcycling” from dense models to enable efficient pre-training on 5T tokens, while reportedly delivering a leading performance-to-compute ratio.
On English and multilingual benchmarks, Marco-MoE outperforms similarly sized competitors, and its post-trained Marco-MoE-Instruct variants beat competing models that activate 3–14× more parameters.
The paper analyzes how the model learns structured, language-shared expert activation patterns while still keeping specialized usage for linguistically isolated languages, and it supports scalable language expansion with less interference than dense models.
To benefit the research community, the authors disclose full training datasets, recipes, and model weights.

Abstract

We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing

3

14\times

more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

Black Hat USA

AI Business

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

Key Points

Abstract

Related Articles

Black Hat USA

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer