Voxtral TTS

arXiv cs.AI / 3/27/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

Voxtral TTSは、3秒程度の参照音声から自然な多言語音声を生成する表現力重視のテキスト・トゥ・スピーチ（TTS）モデルだと紹介されています。
セマンティック音声トークンは自己回帰で生成し、音響トークンはflow-matchingで生成するハイブリッド構成を採用しています。
音声トークンの符号化・復号には、ハイブリッドVQ-FSQ量子化の方針でスクラッチから学習されたVoxtral Codec（音声トークナイザ）を用います。
ネイティブスピーカーによる評価では、Voxtral TTSがElevenLabs Flash v2.5より自然さと表現力の面で優位となり、マルチリンガル音声クローンで68.4%の勝率を達成したと報告されています。
モデル重みはCC BY-NCライセンスで公開されています。

Abstract

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

Dev.to

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Dev.to

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Dev.to

Neural Networks in Mobile Robot Motion

Dev.to

Voxtral TTS

Key Points

Abstract

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Neural Networks in Mobile Robot Motion

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer