Mistral releases a new open-source model for speech generation

TechCrunch / 3/26/2026

📰 NewsSignals & Early TrendsIndustry & Market MovesModels & Research

Key Points

  • Mistral has released Voxtral TTS, a new open-source text-to-speech model aimed at voice AI assistants and enterprise applications like customer support and sales engagement.
  • The model supports nine languages—English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic—positioning it for multilingual deployments.
  • Mistral says Voxtral TTS is designed to run on smaller devices (e.g., smartwatch, smartphone, laptop, and other edge hardware) with significantly lower costs than comparable offerings.
  • By targeting competitive pricing and broad deployment options, Mistral is positioning itself directly against TTS players such as ElevenLabs, Deepgram, and OpenAI.

French AI company Mistral released a new open-source text-to-speech model on Thursday that can be used by voice AI assistants or in enterprise use cases like customer support. The model, which lets enterprises build voice agents for sales and customer engagement, puts Mistral in direct competition with the likes of ElevenLabs, Deepgram, and OpenAI.

The new model, called Voxtral TTS, supports nine languages, including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

“Our customers have been asking for a speech model. So we built a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices. The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance,” Pierre Stock, vp of science operations at Mistral AI, told TechCrunch during a phone interview.

Image Credits: Mistral

Mistral said the new model can adapt a custom voice with a sample of less than five seconds, and also capture characteristics like subtle accents, inflections, intonations, and irregularities in the flow of speech. The model, based on Ministral 3B, can switch between languages easily without losing the characteristics of the voice, which is useful for use cases like dubbing or real-time translation. Stock said the company wanted the model to sound human and not robotic.

The model has been built for real-time performance, according to the company. It has a time-to-first-audio (TTFA) — a measure of when the model starts ‘speaking’ after receiving input — of 90ms for a 10-second sample of 500 characters. The model also has a real-time factor (RTF) of 6x, which means it can render a 10-second clip in roughly 1.6 seconds.

Image Credits: Mistral AI

Earlier this year, Mistral launched a pair of transcription models, one for large batch processing and the other for real-time use cases with low latency. With the new speech model, the company is likely aiming to provide a full suite of voice products to enterprises.

“We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well. The main benefit of that is you get way more information with an end-to-end agentic system that supports audio as an input or output,” Stock said.

Techcrunch event

Disrupt 2026: The tech ecosystem, all in one room

Your next round. Your next hire. Your next breakout opportunity. Find it at TechCrunch Disrupt 2026, where 10,000+ founders, investors, and tech leaders gather for three days of 250+ tactical sessions, powerful introductions, and market-defining innovation. Register now to save up to $400.

Save up to $300 or 30% to TechCrunch Founder Summit

1,000+ founders and investors come together at TechCrunch Founder Summit 2026 for a full day focused on growth, execution, and real-world scaling. Learn from founders and investors who have shaped the industry. Connect with peers navigating similar growth stages. Walk away with tactics you can apply immediately

Offer ends March 13.

San Francisco, CA | October 13-15, 2026

Mistral’s positioning is that its open source and customization bit will help enterprises adopt its voice models over competitors, as they can tune it the way they want.