feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

Reddit r/LocalLLaMA / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

AesSedai has contributed a pull request to ggml-org/llama.cpp adding support for the XiaomiMiMo MiMo v2.5 model.
MiMo v2.5 is a sparse Mixture-of-Experts (MoE) architecture with 310B total parameters and 15B activated parameters.
The model supports very long context lengths—up to 1 million tokens—and is multimodal, handling text, image, video, and audio.
The architecture includes a 729M-parameter ViT vision encoder, a 261M-parameter audio transformer encoder, and a Multi-Token Prediction (MTP) component with 329M parameters.
This update broadens llama.cpp’s capability for local multimodal inference by enabling deployment of the MiMo v2.5 family.

Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters
Context Length: Up to 1M tokens
Modalities: Text, Image, Video, Audio
Vision Encoder: 729M-param ViT (28 layers: 24 SWA + 4 Full)
Audio Encoder: 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full)
Multi-Token Prediction (MTP): 329M parameters, 3 layers