Originally published on NextFuture
What's new this week
On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.
Why it matters for builders
Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.
AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ...




