vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

Reddit r/LocalLLaMA / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article announces vibevoice.cpp, a pure-C++ (ggml) port of Microsoft VibeVoice that enables speech-to-speech voice cloning TTS and long-form ASR with speaker diarization.
It supports multiple inference backends—CPU, CUDA, Metal, and Vulkan—and can run as a single binary or as an embeddable libvibevoice.so with a flat C ABI.
The system includes pre-converted GGUF model assets for TTS (with voice prompt conversion) and a 7B-parameter long-form ASR that outputs JSON segments containing start/end times, speaker labels, and text.
Reported performance includes about 28s real-time factor on a 68s sample using CUDA (Q4_K) and roughly 17 minutes max-tested audio on CPU, with memory usage scaling heavily on longer inputs.
Compared with Microsoft’s original Python/Transformers/vLLM approach, the port removes Python and torch from inference while matching the same core model components and running a closed-loop TTS→ASR recall test in CI.

A few weeks ago I shipped vibevoice.cpp, a pure-C++ ggml port of Microsoft
VibeVoice (the speech-to-speech model with voice cloning, https://github.com/microsoft/VibeVoice). Wanted to post a follow-up here because we're at a point where the engine has grown well past "first-pass port" and into something other people might actually want to run.

This work was brought to you with <3 from the LocalAI team!

What it does:

TTS with pre-converted voice prompts (any of upstream's .pt voices, ours or yours converted via scripts/convert_voice_to_gguf.py): give it a 30s reference clip, generate 24kHz speech in the cloned voice. Ships pre-converted GGUFs (0.5B realtime model) on https://huggingface.co/mudler/vibevoice.cpp-models
Long-form ASR with speaker diarization : 7B-parameter model, returns
JSON segments {start, end, speaker, content}. Tested up to 17 minutes
audio in one shot.

Backends: CPU (CPU-only baseline), CUDA, Metal, Vulkan, hipBLAS via ggml's
backend dispatch. Single binary or libvibevoice.so + flat C ABI for embedding (purego/cgo/dlopen-friendly).

Numbers:

 Inference RTF Peak RSS 68s sample, CUDA Q4_K (GB10): 28 s 0.41 ~6 GB 68s sample, CPU Q4_K (R9): 150 s 2.20 ~8 GB 17min audio, CPU Q8_0: 1929 s 1.94 ~26 GB

Compared to upstream Microsoft Python + Transformers + vLLM plugin:

Same Qwen2.5 7B/0.5B backbone, same DPM-Solver diffusion head, same windowed prefill (5 text tokens / 6 speech frames per the mlx-audio pattern).
Closed-loop TTS→ASR test asserts 100% source-word recall on a fixed seed; runs in CI.
No Python at inference, no vLLM, no torch.

Limitations / honest:

17min audio peak is still 26 GB on CPU because of the encoder activation pool + 14 GB Q8_0 weights. Q4_K cuts the model side (~10 GB on disk), but the encoder pool needs its own work.
The diffusion head builds 20 small graphs per latent frame; graph reuse there is the next obvious win.
No streaming output yet. emits a complete WAV / full transcript.
ASR transcript quality is what upstream gives you; on a 17min Italian audio the recovered transcript is faithful through natural sentence boundaries.

Repo: https://github.com/mudler/vibevoice.cpp (MIT)

Models: https://huggingface.co/mudler/vibevoice.cpp-models

LocalAI integration: This work was done with <3 from the LocalAI team. vibevoice.cpp is already a backend which can be used ready-to-go in LocalAI !

Happy to answer questions and feedback!

submitted by /u/mudler_it
[link] [comments]