Encodec比80×圧縮で90分・4話者の会話を合成するVibeVoiceを解説する

Zenn / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

VibeVoiceは、Encodec比80×の圧縮を活用し、90分の会話データから4話者分の会話音声を合成する長時間・多話者音声生成の仕組みを解説しています。
目標は、会話の自然さを保ちつつ、音声表現を大きく圧縮して学習・生成を効率化する点にあります。
記事では、圧縮により扱う情報量を減らしながら、複数話者の発話を区別して合成するための考え方（音声生成パイプラインの設計）を中心に説明します。
実用面として、長尺会話の生成や、限られた音声収録時間での多話者音声合成の実現可能性が論点になります。

はじめにこの論文の完全解説（英語・図解付き）は flecto で公開中 → 論文の概要（TL;DR） VibeVoice は7.5 HzトークナイザーでEncodec比80×圧縮を実現し、次トークン拡散によって最大4話者・90分の自然な対話を1つのLLMコンテキストウィンドウ内で合成できる画期的なTTSモデルです。音声品質はMOS 3.76を達成し、Gemini-2.5-Pro-Preview-TTS（3.40）やEleven-V3 Alpha（3.66）を含む競合モデルを上回ります。背景と問題設定近年のTTSは1話者・短い発話では目覚ましい進歩を遂げていますが、長尺・...

Continue reading this article on the original site.

Read original →

Meta's latest model is as open as Zuckerberg's private school

The Register

Why multi-agent AI security is broken (and the identity patterns that actually work)

Dev.to

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

Reddit r/artificial

A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export

MarkTechPost

Harness Engineering: The Next Evolution of AI Engineering

Dev.to

Encodec比80×圧縮で90分・4話者の会話を合成するVibeVoiceを解説する

Key Points

Related Articles

Meta's latest model is as open as Zuckerberg's private school

Why multi-agent AI security is broken (and the identity patterns that actually work)

BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.

A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export

Harness Engineering: The Next Evolution of AI Engineering

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer