XiaomiMiMo/MiMo-V2.5-ASR · Hugging Face

Reddit r/LocalLLaMA / 4/24/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • MiMo-V2.5-ASR is Xiaomi MiMoチームによる最先端のエンドツーエンド音声認識(ASR)モデルで、中国語(複数方言)と英語にまたがる高精度な文字起こしを目指しています。
  • 方言混在やコードスイッチ(中国語–英語の切替)に対応し、言語タグなしで自然に書き起こせる設計になっています。
  • 雑音下(遠距離集音など)や多話者の重なり会話、知識量の多い内容(固有名詞、地名、専門用語、古典詩など)、さらに歌詞認識にも強い性能を示しています。
  • 新たな学習アプローチとして、大規模な中間学習、高品質な教師あり微調整、そして独自の強化学習アルゴリズムにより、複数の評価軸で体系的な改善を達成したとしています。
  • 幅広い公開ベンチマークで最先端(SOTA)結果を報告し、英語の難しめベンチマークではOpen ASR Leaderboardでも高い性能を示しています。
XiaomiMiMo/MiMo-V2.5-ASR · Hugging Face

MiMo-V2.5-ASR is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.

Abstract

Automatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. We present MiMo-V2.5-ASR, a large-scale end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions:

  • 🗣️ Chinese Dialects: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more.
  • 🔀 Code-Switch: Seamless Chinese–English code-switching transcription with no language tags required.
  • 🎵 Song Recognition: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals.
  • 🔊 Noisy Environments: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions.
  • 👥 Multi-Speaker: Accurate transcription of overlapping, multi-party conversations such as meetings.
  • 🇬🇧 Complex English Scenarios: Leading performance on the Open ASR Leaderboard for challenging English benchmarks such as AMI.
  • 📚 Knowledge-Intensive Recognition: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material.
  • 📝 Native Punctuation: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed.
submitted by /u/jacek2023
[link] [comments]