Originally published on NextFuture
What's new this week
On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.
Why it matters for builders
Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.
AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ... and TTS returns one WAV with two distinct voices. Your agent graph loses a node, latency drops by a full network round-trip, and you skip maintaining a separate speaker-labelling prompt that drifts every model upgrade.
Indie maker. A Duolingo-style pronunciation app priced at $4/month barely broke even on Azure TTS at roughly $0.05 per lesson. At $20 per million output audio tokens, a 30-second lesson now costs about $0.003 — gross margin holds above 90% on the same $4 tier. Voice is no longer the line item that kills your side project's unit economics, which means TTS-heavy features like audiobook summaries, podcast previews, or accessibility narration finally pencil out on a free-tier SaaS.
Hands-on: try it in under 15 minutes
Grab a free API key from aistudio.google.com, store it as GEMINI_API_KEY, then install the SDK:
What's new this week
On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.
Why it matters for builders
Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.
AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ...




