Mistral Introduces "Voxtral TTS": An Open-Weight Text-to-Voice Model Capable Of Cloning Any Voice From 3 Seconds Of Audio, Runs In 9 Languages, & Beats Elevenlabs Flash V2.5 With A 68.4% Human Preference Win Rate.

Reddit r/LocalLLaMA / 4/7/2026

📰 NewsSignals & Early TrendsIndustry & Market MovesModels & Research

Key Points

  • Mistral introduced Voxtral TTS, an open-weight text-to-voice model that claims it can clone a person’s voice from just 3 seconds of audio without fine-tuning or training changes (zero-shot).
  • The model is reported to support 9 languages and perform cross-lingual voice cloning, such as using a French voice prompt to generate English speech.
  • Mistral reports strong benchmark results, including a 68.4% human preference win rate in zero-shot multilingual voice cloning against ElevenLabs Flash v2.5 and parity on emotional expressiveness and quality with ElevenLabs v3.
  • Voxtral TTS is described as low-latency (about 70ms model latency / similar time-to-first-audio to Flash v2.5) and efficient enough to run on 3GB RAM, targeting smartphone/laptop/edge deployment.
  • By releasing the weights on Hugging Face, Mistral positions Voxtral TTS as a challenge to proprietary, API-locked approaches in voice cloning and TTS markets.
Mistral Introduces "Voxtral TTS": An Open-Weight Text-to-Voice Model Capable Of Cloning Any Voice From 3 Seconds Of Audio, Runs In 9 Languages, & Beats Elevenlabs Flash V2.5 With A 68.4% Human Preference Win Rate.

ElevenLabs built a moat on proprietary weights and API lock-in. Mistral just put the weights on Hugging Face.

The model captures not just the voice but the person. Accents, inflections, intonations, vocal fillers the "ums" and "ahs" that make a voice sound human instead of synthetic. From 3 seconds of reference audio. Zero fine-tuning. Zero shot.


Key Highlights:

  • → 68.4% win rate against ElevenLabs Flash v2.5 in zero-shot multilingual voice cloning

  • → Beats ElevenLabs Flash v2.5 on every one of the 9 supported languages

  • → Matches ElevenLabs v3 on emotional expressiveness and quality

  • → 70ms model latency same time-to-first-audio as Flash v2.5 at higher quality

  • → 4B parameters. Runs on 3GB RAM. Smartphone. Laptop. Edge devices.

  • → 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic

  • → Cross-lingual voice cloning French voice prompt generating English speech works out of the box


Link to the Official Announcement: https://mistral.ai/news/voxtral-tts

Link to the Paper: https://arxiv.org/pdf/2603.25551

Link to the Model Weights: https://huggingface.co/mistralai/Voxtral-4B-TTS-2603
submitted by /u/44th--Hokage
[link] [comments]