I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

Dev.to / 4/5/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author built GoNoGo.team as a true speech-to-speech voice AI system using the Gemini 2.5 Flash Live API, avoiding an STT→LLM→TTS text pipeline.
  • The biggest engineering challenge turned out not to be multi-agent reasoning or orchestration, but preventing the AI from “hearing itself” through echo cancellation so it doesn’t interrupt its own speech.
  • Because the system streams raw PCM audio over WebSockets (16kHz mic input and 24kHz agent output) rather than working with text buffers, client-side audio handling and latency management become central.
  • Practical implementation details include chunking browser mic audio into ~32ms frames (e.g., 512-sample chunks) and using RMS-based analysis for VAD to manage when audio is sent upstream.

When I started building GoNoGo.team — a platform where AI agents interview founders by voice to validate startup ideas — I thought the hard part would be the AI reasoning. The multi-agent orchestration. The 40+ function-calling tools.

I was wrong.

The hard part was echo. Specifically: how do you stop an AI agent from hearing itself talk, freaking out, and interrupting its own sentence?

After 500+ voice sessions and too many late nights staring at RMS waveforms, here's what I actually learned.

The Setup: Speech-to-Speech, Not STT → LLM → TTS

GoNoGo runs on Gemini 2.5 Flash Live API — a true speech-to-speech pipeline. There's no intermediate transcription step, no text-to-speech synthesis layer bolted on afterward. Audio goes in, audio comes out. Direct.

This is important because it changes everything about how you handle audio on the client. You're not working with text buffers. You're working with raw PCM, 16kHz input from the browser mic, 24kHz output from the agent voice. Base64-encoded over WebSocket.

The browser capture side looks roughly like this:

// ScriptProcessorNode in browser — 512-sample chunks (~32ms each)
const scriptProcessor = audioContext.createScriptProcessor(512, 1, 1);

scriptProcessor.onaudioprocess = (event) => {
  const inputBuffer = event.inputBuffer.getChannelData(0);

  // Calculate RMS for VAD
  const rms = Math.sqrt(
    inputBuffer.reduce((sum, sample) => sum + sample * sample, 0) / inputBuffer.length
  );

  // VAD threshold: 0.05 RMS
  if (rms < VAD_THRESHOLD) return;

  // Convert Float32 PCM to Int16
  const int16Buffer = new Int16Array(inputBuffer.length);
  for (let i = 0; i < inputBuffer.length; i++) {
    int16Buffer[i] = Math.max(-32768, Math.min(32767, inputBuffer[i] * 32768));
  }

  // Base64 encode and send over WebSocket
  const base64Audio = btoa(String.fromCharCode(...new Uint8Array(int16Buffer.buffer)));
  ws.send(JSON.stringify({ type: 'audio_chunk', data: base64Audio }));
};

Simple enough. Until the AI starts talking.

The Echo Problem (And Why Browser AEC Isn't Enough)

Browsers have built-in acoustic echo cancellation. You enable it when you call getUserMedia:

const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true
  }
});

This works great for video calls between humans. It was designed for that. But it has a fundamental assumption baked in: the "far end" audio is coming through a <audio> element or Web Audio API that the browser knows about.

When you're playing 24kHz PCM chunks from a WebSocket, decoded manually and scheduled through AudioContext buffers? The browser's AEC has no idea that audio exists. It can't cancel what it can't see.

So your AI agent starts speaking. The microphone picks up the speaker output. The agent hears itself. In the best case, it gets confused and repeats something. In the worst case — and this happened constantly in early builds — you get a feedback loop where the agent interrupts itself mid-sentence, hears the interruption, tries to respond to it, hears that, and the whole session collapses.

I called these 1011 disconnects, because that was the WebSocket close code I kept seeing in logs.

The Two-Tier RMS Gate

The fix is a two-tier RMS (Root Mean Square) gate on the audio capture side. The idea is simple: measure the loudness of what the mic is picking up, and if it's probably just the speaker playing back, don't send it.

But "simple" hides a lot of edge cases.

Tier 1: Hard suppress during agent speech

While the agent is actively speaking, I track that state server-side and send it to the client. During this window, incoming audio is suppressed entirely — no chunks sent to Gemini.

let agentSpeaking = false;
let cooldownTimer: ReturnType<typeof setTimeout> | null = null;
const COOLDOWN_MS = 1500;
const COOLDOWN_THRESHOLD = 0.03; // Higher threshold during cooldown
const NORMAL_THRESHOLD = 0.05;   // Normal VAD threshold

// Called when agent audio stream starts/stops
function setAgentSpeakingState(speaking: boolean) {
  if (speaking) {
    agentSpeaking = true;
    if (cooldownTimer) clearTimeout(cooldownTimer);
  } else {
    agentSpeaking = false;
    // Start cooldown period
    cooldownTimer = setTimeout(() => {
      cooldownTimer = null;
    }, COOLDOWN_MS);
  }
}

function shouldSendAudioChunk(rms: number): boolean {
  if (agentSpeaking) return false; // Hard suppress

  if (cooldownTimer !== null) {
    // In cooldown: use higher threshold
    return rms > COOLDOWN_THRESHOLD;
  }

  return rms > NORMAL_THRESHOLD;
}

Tier 2: The 1.5-second cooldown

This is the part that took me longest to figure out. When the agent stops talking, there's still speaker resonance in the room. The RMS of captured audio doesn't drop to zero immediately — it decays. The background noise in a typical home office sits at 0.01–0.02 RMS. But for 1-2 seconds after playback stops, you're seeing 0.025–0.04 RMS — above the normal VAD threshold.

The cooldown period uses a higher threshold (0.03 vs 0.05) for 1.5 seconds after agent speech ends. This catches the decay without cutting off a founder who immediately starts talking back.

Was this threshold tuned empirically? Absolutely. I spent days listening to session replays measuring exactly how fast room resonance decays in different mic setups.

Session Resumption: The Other Half of the Problem

Echo cancellation solved the quality problem. Session resumption solved the reliability problem.

Gemini Live sessions drop. Network hiccups, mobile handoffs, Chrome deciding to do something aggressive with memory — connections fail. Early on, a dropped connection meant starting the entire 30-minute interview over. Founders would ragequit. I would understand completely.

The fix: store session handles in Firestore and resume on reconnect.

# FastAPI backend — session management
from google.genai.live import AsyncSession
from firebase_admin import firestore

async def get_or_create_session(
    project_id: str, 
    user_id: str
) -> tuple[AsyncSession, bool]:
    db = firestore.client()
    session_ref = db.collection('sessions').document(f'{user_id}_{project_id}')
    session_doc = session_ref.get()

    if session_doc.exists:
        session_data = session_doc.to_dict()
        handle = session_data.get('resumption_handle')

        if handle:
            try:
                # Attempt resume — Gemini picks up exactly where it left off
                session = await resume_gemini_session(handle)
                return session, True  # resumed=True
            except Exception:
                pass  # Fall through to new session

    # Create new session
    session = await create_gemini_session(project_id)
    session_ref.set({
        'created_at': firestore.SERVER_TIMESTAMP,
        'project_id': project_id
    })
    return session, False  # resumed=False

async def store_resumption_handle(user_id: str, project_id: str, handle: str):
    db = firestore.client()
    session_ref = db.collection('sessions').document(f'{user_id}_{project_id}')
    session_ref.update({'resumption_handle': handle})

When a session resumes, Gemini restores full context — every tool call result, every piece of market research, every persona in the synthetic focus group. The founder reconnects and the agent says "Sorry about that, where were we?" and genuinely knows where you were.

The Filler Audio Problem

One more thing nobody talks about: what do you play while the AI is thinking?

Gemini 2.5 Flash is fast. 300-500ms end-to-end is genuinely fast. But when the agent is executing a tool call — crawling a competitor site with Playwright, running Reddit scraping, calculating unit economics — you can have 3-8 second gaps.

Silence in a voice conversation feels broken. Users assume the connection dropped.

Solution: pre-computed filler audio. Short phrases like "one moment please" or "let me look that up" in 17 languages, stored as PCM chunks, played when tool execution exceeds ~800ms. The agent is triggered via text signal (not proactive_audio, which had a regression that caused double-playback — disabled entirely, use text triggers instead).

This sounds trivial. It removed about 40% of "the app is broken" support messages.

What I'd Do Differently

  1. Start with the echo gate, not the AI logic. I spent weeks building beautiful multi-agent orchestration before I could demo it reliably. Wrong order.

  2. Instrument RMS values from day one. Log them. Every session. You can't tune what you can't see.

  3. Test on bad hardware. My dev setup has a good mic with physical distance from speakers. Most users have laptop mics 30cm from laptop speakers. Build for that.

  4. Mobile is a different planet. iOS Safari handles AudioContext lifecycle in ways that will make you question your career choices. But that's an article for another day.

The Result

After solving these problems — the two-tier RMS gate, the 1.5s cooldown, the session resumption, the filler audio — GoNoGo runs 15-45 minute voice sessions with real founders, across 21 languages, with 3 AI agents handing off to each other mid-conversation. The 1011 disconnects essentially disappeared.

The voice infrastructure became invisible, which is exactly what it should be.

If you're building anything with browser mic + real-time AI audio: what's been your biggest challenge? I'm genuinely curious whether the echo problem is universal or whether I was doing something particularly wrong early on. Drop it in the comments.