Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to / 3/31/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Google launched Gemini 3.1 Flash Live Preview, positioning it as a higher-quality real-time audio/voice model for low-latency conversational experiences.
  • The model supports audio-to-audio (not speech-to-text-first) to keep interactions feeling faster and more natural, with added benefits like lower latency and stronger handling of background noise.
  • It can capture acoustic nuances (pitch, pace, tone), supports real-time multilingual conversations in 90+ languages, and offers longer conversation memory (about twice the previous generation).
  • Gemini 3.1 Flash Live Preview introduces mid-conversation tool use during live interactions, enabling real-time triggering of external APIs/functions/searches as the dialogue unfolds.
  • The article provides a step-by-step guide to building a working AI voice agent by connecting the model via VideoSDK’s Python SDK, starting with setting up a dedicated Python virtual environment.

Google just launched Gemini 3.1 Flash Live Preview its most capable real-time voice and audio model yet. If you're building AI voice agents, conversational apps, or anything that needs low-latency audio intelligence, this model is a big deal. And with VideoSDK's Python SDK, plugging it into your app takes just a few minutes.

In this blog, we'll walk through what the new model can do, and then build a working voice agent step by step using VideoSDK.

What's New in Gemini 3.1 Flash Live Preview

Google describes this as its "highest-quality audio and voice model yet," and there are a few things that actually back that up.

It's built for real-time, audio-first experiences. Unlike models that convert speech to text and then process it, Gemini 3.1 Flash Live works audio-to-audio meaning it hears you and responds as audio, keeping the conversation feeling natural and fast.

Here's what stands out:

  • Lower latency than before. Compared to 2.5 Flash Native Audio, this model is noticeably faster. Fewer awkward pauses, snappier responses. That matters a lot when you're building voice agents where delays break the experience.
  • It actually understands how you say things. The model picks up on acoustic nuances, pitch, pace, tone. So it can tell when you're asking a casual question vs. when you sound urgent or confused.
  • Better background noise handling. It filters out noise more effectively, which means it works in real environments, not just quiet studios.
  • Multilingual out of the box. Over 90 languages supported for real-time conversations.
  • Longer conversation memory. It can follow the thread of a conversation for twice as long as the previous generation. So your agent won't "forget" what was said earlier in a long session.
  • Tool use during live conversations. This one is huge for agent builders. The model can now trigger external tools (APIs, functions, searches) while a live conversation is happening not just at the end of a turn.
  • Multimodal awareness. It handles audio and video inputs together, so you can build agents that respond to what they see and hear at the same time.
  • The model ID is: gemini-3.1-flash-live-preview

Building a Voice Agent with VideoSDK

VideoSDK gives you everything you need to wire Gemini 3.1 Flash Live into a real voice application. Here's how to get set up from scratch.

Step 1 : Create and Activate a Python Virtual Environment

First, create a clean Python environment so your project dependencies stay isolated.

python3 -m venv venv

Activate it:

macOS/Linux

source venv/bin/activate

Windows

venv\Scripts\activate

You should see (venv) in your terminal, which means you're good to go.

Step 2 : Set Up Your Environment Variables

Create a .env file in your project root and add your API keys:

VIDEOSDK_AUTH_TOKEN=your_videosdk_token_here
GOOGLE_API_KEY=your_google_api_key_here

You can get your VideoSDK auth token from the VideoSDK dashboard and your Google API key from Google AI Studio.

Important: when GOOGLE_API_KEY is set in your .env file, do not pass api_key as a parameter in your code the SDK picks it up automatically.

Step 3 : Install the Required Packages

Install VideoSDK's agents SDK along with the Google plugin:

pip install "videosdk-agents[google]"

Step 4 : Create Your Agent (main.py)

Create a file called main.py in your project folder and paste in the following code:

from videosdk.agents import Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[logging.StreamHandler()])

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You Are VideoSDK's Voice Agent.You are a helpful voice assistant that can answer questions and help with tasks.",
        )

    async def on_enter(self) -> None:
        await self.session.say("Hello, how can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye!")

async def start_session(context: JobContext):
    agent = MyVoiceAgent()
    model = GeminiRealtime(
        model="gemini-3.1-flash-live-preview",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        # api_key="AIXXXXXXXXXXXXXXXXXXXX", 
        config=GeminiLiveConfig(
            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    pipeline = Pipeline(llm=model)
    session = AgentSession(
        agent=agent,
        pipeline=pipeline
    )

    await session.start(wait_for_participant=True, run_until_shutdown=True)

def make_context() -> JobContext:
    room_options = RoomOptions(
        # room_id="<room_id>", # Replace it with your actual room_id
        name="Gemini Realtime Agent",
        playground=True,
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

To run the agent:

python main.py

Once you run this command, a playground URL will appear in your terminal. You can use this URL to interact with your AI agent.

What Can You Build With This?

Gemini 3.1 Flash Live + VideoSDK opens up a pretty wide range of real-world use cases:

  • Customer support voice bots. Replace or supplement your call center with agents that actually understand tone and can handle multilingual customers in real time.
  • AI meeting assistants. Agents that join calls, take notes, answer questions from participants, and trigger follow-up actions mid-conversation.
  • Healthcare intake agents. Voice-based triage agents that collect patient information, ask follow-up questions, and route to the right department all in a natural spoken conversation.
  • Language tutors. Real-time conversation partners that catch pronunciation issues, adjust their pace based on the learner, and respond naturally.
  • Voice-controlled IoT and home automation. Agents that listen continuously, understand context, and trigger device actions through tool use all in sub-second response times.
  • Live interview prep tools. Candidates practice answering questions aloud and get spoken feedback instantly.

Conclusion

Gemini 3.1 Flash Live Preview is a meaningful step forward for real-time voice AI. The improvements in latency, noise handling, multilingual support, and especially live tool use make it a strong foundation for production voice agents.

VideoSDK wraps all of that into a clean Python SDK that gets you from zero to a running agent in a handful of lines. Whether you're prototyping or building something you intend to ship, the setup here gives you a solid starting point.

Next Steps and Resources