VoxCPM2 is out - 2B params, 30 languages. Major upgrade over VoxCPM1.5.

Reddit r/LocalLLaMA / 4/10/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • OpenBMB has released VoxCPM2, positioning it as a major upgrade over VoxCPM1.5 with much larger scale and expanded capabilities.
  • VoxCPM2 increases parameters to 2B, trains on 1.8M hours, and supports Chinese and English, with audio output at 44.1kHz and improved real-time performance (reported RTF 0.17 on an RTX 4090).
  • The release adds new voice generation features including Voice Design from text alone, Controllable Cloning with emotion/pace/expression steering, and Ultimate Cloning for higher-fidelity results using reference audio plus transcripts.
  • VoxCPM2 is distributed via Hugging Face and is described as operating with ~8GB VRAM and streaming support, lowering deployment requirements.
  • The community is already comparing it against other TTS systems (e.g., Qwen3-TTS, Open-MOSS, OmniVoice), focusing on multilingual coverage, latency/RTF, audio fidelity, and how well text-only voice design works.

OpenBMB just dropped VoxCPM2, the follow-up to their VoxCPM-0.5B. Big jump in scale and capabilities.

OpenBMB just released VoxCPM2, a significant step up from VoxCPM1.5.

VoxCPM1.5 → VoxCPM2:

VoxCPM1.5 VoxCPM2
Params 0.5B
Audio quality 44.1kHz
Languages Chinese + English
Training data 1.8M hours
RTF (RTX 4090) 0.17
Voice Design

New in VoxCPM2:

  • Voice Design — generate a novel voice from a text description alone, no reference audio needed
  • Controllable Cloning — clone + steer emotion, pace, expression
  • Ultimate Cloning — max fidelity with reference audio + transcript
  • ~8GB VRAM, streaming support

HuggingFace: https://huggingface.co/openbmb/VoxCPM2

Anyone tested VoxCPM2 yet?

  • vs Qwen3-TTS — naturalness and multilingual coverage?
  • vs Open-MOSS — latency and voice quality?
  • OmniVoice (k2-fsa) — covers 646 languages vs VoxCPM2's 30, RTF of 0.025 vs 0.30, but 24kHz vs 48kHz. Quality tradeoff worth it for the speed and language coverage?
  • Does Voice Design (no reference audio) actually hold up?
  • Non-English results?

Audio comparisons would be great if anyone has them.

submitted by /u/Downtown_Radish_8040
[link] [comments]