共有:
Gemini Live API

Carry a voice, as a voice,
into 70 languages.

Until now, real-time voice translation meant building a three-stage stack of recognition, translation, and synthesis yourself. Gemini Live API folds all of it into a single endpoint—voice to voice, with the speaker's tone preserved. We unpack how it works and where to use it, in diagrams.

AI Navigate Editorial·2026.06.10·6 min read
PIPELINE Recognize Translate Synthesize 1–3 sec lag at each stage Tone is lost along the way LIVE API Voice → voice, one model Continuous streaming translation Keeps tone, pace, and pitch
01
The Old Way

The "weight" of a
three-stage pipeline

Until half a year ago, embedding real-time voice translation into your own app meant building a three-stage pipeline yourself: speech recognition → text translation → speech synthesis. You had to wire together separate services and models, align their input and output formats, and handle the behavior on errors—just standing it up took considerable effort.

What made it worse was latency. Each stage took one to three seconds, and stacked together they could not keep up with the rhythm of a conversation. The "pause" between someone finishing their sentence and the translation coming back chipped away at how natural the exchange felt. And because the design dropped everything to text along the way, the speaker's tone and emotion were lost right there.

Three-stage pipeline (traditional)Gemini Live API
Recognition, translation, and synthesis wired separatelyDone within a single endpoint
1–3 sec lag accumulates at each stageContinuous streaming generation
Tone disappears once routed through textPreserves tone, pace, and pitch
Language support depends on the combinationSupports more than 70 languages

Don't drop it to text—
hand the voice over as a voice.


02
How It Works

Voice to voice, directly

It removes the "relay point" of routing through text, raising a stream of translated audio directly from the stream of input audio.

Input audio Single model Voice → voice Translated audio Keeps the same tone
FIG. Without a text relay in between, translated audio is generated directly from the stream of input audio
01

Take the voice as it is

The audio stream coming in from the microphone is fed straight into the model before being transcribed to text. Processing starts mid-utterance, so there's no need to wait until the speaker finishes.

02

Translate it as a voice

A single model grasps the meaning internally and outputs the translation continuously as audio. Because it never drops to text along the way, a voice in another language rises up while preserving the speaker's tone, speed, and pitch.

03

Call it with one API

Developers touch just one endpoint. The wiring that stitched three services together and the latency management at each stage both disappear, sharply lowering the cost of building it in.

03
What Shipped

Opened straight up
to developers

Google released "Gemini 3.5 Live Translate" to developers as the Gemini Live API. A new use case—simultaneous translation—joins the frontier model lineup.

Live API 1 endpoint Conference SaaS Language-learning app Support automation 70+ LANGUAGES Continuous translation that keeps the speaker's tone
FIG. With one endpoint at the center, it extends to diverse apps and to more than 70 target languages
70+
Languages supported
1
Endpoints needed
Voice→voice
Done in a single model

At launch, it also shipped built into Google Meet and Google Translate. In other words, technology Google itself uses in production across its own products has come down to developers' hands as an API, just as it is.

Achieving "real-time voice translation" used to be a research-project-grade challenge in its own right. Now it's within reach simply by shifting the center of gravity of your design onto a single endpoint. This is the moment the barrier to entry for products handling voice × many languages clearly dropped.

04
In Practice

Who it helps, and how

Products that need voice and many languages at once now have a "directly usable part" that didn't exist before.

Conference SaaS

Participants speak in their native language and hear the other person in theirs. Without relying on text captions, simultaneous interpretation that keeps the rhythm of the voice can be done entirely within the app.

Language-learning app

Convert a model utterance into any language instantly, delivering it to learners with the nuances of pitch and pace intact. It also suits turning pronunciation and tone into teaching material.

Support automation

Roll out multilingual customer support while preserving the operator's tone. It becomes easier to widen the language coverage of your inquiry handling.


05
Caveats

What to check
before adopting it

It's an attractive option, but don't overtrust it. The biggest point is judging the quality. How well it "translates while keeping the speaker's voice" can vary with the language pair, the audio environment, and how much specialized terminology is involved. Until you actually try it on your own use case—not the catalog figures—you should treat it as an unknown.

If it can replace your existing three-stage pipeline, it should improve both latency and implementation cost. That's exactly why you should run an accuracy check on your own language pair and intended scenario once before putting it into production. Pass through that, and this becomes a solid move.

AI Navigate — Daily Update · 2026.06.10