VoxCPM2 is out - 2B params, 30 languages. Major upgrade over VoxCPM1.5.

Reddit r/LocalLLaMA / 4/10/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

OpenBMB has released VoxCPM2, positioning it as a major upgrade over VoxCPM1.5 with much larger scale and expanded capabilities.
VoxCPM2 increases parameters to 2B, trains on 1.8M hours, and supports Chinese and English, with audio output at 44.1kHz and improved real-time performance (reported RTF 0.17 on an RTX 4090).
The release adds new voice generation features including Voice Design from text alone, Controllable Cloning with emotion/pace/expression steering, and Ultimate Cloning for higher-fidelity results using reference audio plus transcripts.
VoxCPM2 is distributed via Hugging Face and is described as operating with ~8GB VRAM and streaming support, lowering deployment requirements.
The community is already comparing it against other TTS systems (e.g., Qwen3-TTS, Open-MOSS, OmniVoice), focusing on multilingual coverage, latency/RTF, audio fidelity, and how well text-only voice design works.

OpenBMB just dropped VoxCPM2, the follow-up to their VoxCPM-0.5B. Big jump in scale and capabilities.

OpenBMB just released VoxCPM2, a significant step up from VoxCPM1.5.

VoxCPM1.5 → VoxCPM2:

New in VoxCPM2:

Voice Design — generate a novel voice from a text description alone, no reference audio needed
Controllable Cloning — clone + steer emotion, pace, expression
Ultimate Cloning — max fidelity with reference audio + transcript
~8GB VRAM, streaming support

Anyone tested VoxCPM2 yet?

vs Qwen3-TTS — naturalness and multilingual coverage?
vs Open-MOSS — latency and voice quality?
OmniVoice (k2-fsa) — covers 646 languages vs VoxCPM2's 30, RTF of 0.025 vs 0.30, but 24kHz vs 48kHz. Quality tradeoff worth it for the speed and language coverage?
Does Voice Design (no reference audio) actually hold up?
Non-English results?

Audio comparisons would be great if anyone has them.