Gemini 3.1 Flash TTSをNext.jsで使う:音声UXを15分で実装(2026)

Dev.to / 2026/4/22

📰 ニュースDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

要点

  • Googleは2026年4月15日に「Gemini 3.1 Flash TTS」をAI StudioとVertex AIでパブリックプレビューとして提供開始し、モデルIDは「gemini-3.1-flash-tts-preview」です。
  • 200以上のインライン音声タグ([whispers]、[happy]、[pause]など)、30種類のプリビルトボイス、多話者対応、70以上の言語をサポートし、音声出力は24kHzモノPCMをbase64でインライン返却します。
  • 価格はテキスト入力が1Mトークンあたり$1、音声出力が1Mトークンあたり$20で、同じ実行条件ではElevenLabsより約1桁安いとされています。
  • Next.jsではgenerateContentを1回呼び、[slow]や[excited]のようなタグを入れるだけで感情のある話し方を再現でき、SSML作成や別のボイス管理用スタジオ連携が不要になります。
  • 音声エージェント開発では、多話者トランスクリプトをインラインで受けられるため、従来のLLM→TTSの2段推論や話者ダイアライゼーションの手作業を減らし、レイテンシも下げられる点が強調されています。

Originally published on NextFuture

What's new this week

On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.

Why it matters for builders

Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.

AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ... and TTS returns one WAV with two distinct voices. Your agent graph loses a node, latency drops by a full network round-trip, and you skip maintaining a separate speaker-labelling prompt that drifts every model upgrade.

Indie maker. A Duolingo-style pronunciation app priced at $4/month barely broke even on Azure TTS at roughly $0.05 per lesson. At $20 per million output audio tokens, a 30-second lesson now costs about $0.003 — gross margin holds above 90% on the same $4 tier. Voice is no longer the line item that kills your side project's unit economics, which means TTS-heavy features like audiobook summaries, podcast previews, or accessibility narration finally pencil out on a free-tier SaaS.

Hands-on: try it in under 15 minutes

Grab a free API key from aistudio.google.com, store it as GEMINI_API_KEY, then install the SDK:

NextFuture

What's new this week

On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.

Why it matters for builders

Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.

AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ...

この記事の続きは原文サイトでお読みいただけます。

原文を読む →