Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Dev.to / 3/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author started from Darryl Ruggles’ bidirectional AI voice agent that uses WebSocket to stream microphone audio and receive synthesized audio, and it worked quickly as a baseline prototype.
They switched the transport layer from WebSocket to WebRTC to reduce perceived latency and avoid TCP head-of-line blocking that can stall real-time audio when packets are lost.
WebRTC also simplified the browser-side audio pipeline by relying on native capture/encoding (Opus) and playback via RTCPeerConnection, eliminating custom base64 PCM encoding and complex AudioWorklet ring-buffer handling.
The migration required changes that “broke” during the journey, reflecting differences in signaling, media handling, and the integration work needed when moving to a new runtime/deployment target (Bedrock AgentCore Runtime).
The author concludes that WebRTC is a more extensible foundation for later adding video/AI avatars because tracks can be added to the existing peer connection without major architecture rewrites.

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

A few weeks ago, I came across Darryl Ruggles' blog post and accompanying repo for a bidirectional voice agent built with Strands BidiAgent and Amazon Nova Sonic v2. His work is remarkably well put together — I had a working voice assistant running on my laptop in about 10 minutes. The agent listens to your voice, searches a recipe knowledge base, sets cooking timers, looks up nutrition data, and converts units, all through natural conversation.

Darryl's version uses WebSocket as the transport between the browser and the agent. It works well, but I wanted to push things further: switch the transport to WebRTC, and deploy the whole thing on Bedrock AgentCore Runtime. This post covers that journey — what changed, what broke, and what I learned along the way.

But first, a short demo!

The full source code is available on GitHub. The repo is Terraform-managed end-to-end, though you can still use Darryl's Makefile approach if you prefer keeping Terraform for surrounding infrastructure and CLI calls for agent deployment.

Why WebRTC for a voice agent

The WebSocket version works, so why change it? A few reasons pushed me toward WebRTC.

First, latency. WebSocket runs over TCP, which means every packet is guaranteed to arrive in order. That's great for chat messages, but for real-time audio, a single dropped packet causes the entire stream to stall while TCP retransmits. WebRTC¹ uses UDP under the hood — if a packet is lost, the stream keeps going. For a voice conversation, a tiny glitch is far better than a noticeable pause.

Second, the browser does more of the heavy lifting. With WebSocket, I had to capture microphone audio using getUserMedia, downsample it to 16kHz with a ScriptProcessorNode, encode it as base64 PCM, and send it as JSON messages. On the playback side, I needed an AudioWorklet with a ring buffer to handle the incoming audio stream. With WebRTC, the browser handles audio capture, encoding (Opus), and playback natively through RTCPeerConnection. The frontend code got significantly simpler.

Third, WebRTC is future-proof for video. AI avatar are getting there with acceptable latency, and WebRTC handles video tracks just as naturally as audio tracks. Adding a video stream later is just a matter of adding a track to the existing peer connection — no architectural change needed.

A quick primer on WebRTC architectures

There are two fundamentally different ways to use WebRTC, and the choice matters when building a voice agent.

Peer-to-peer (P2P)

In P2P WebRTC, two peers connect directly to each other. There's no media server in the middle — audio flows straight from the browser to the agent and back. A TURN² relay server may be needed when one or both peers are behind NAT³ (which is almost always the case in production: clients are behind Internet Router and Agents need to be in private VPC to access company tools), but the TURN server just forwards packets without inspecting or processing them.

Room-based (SFU)

In a room-based architecture, a media server (called an SFU⁴ — Selective Forwarding Unit) sits in the middle. Participants connect to the server, not to each other. The server receives audio/video tracks from each participant and selectively forwards them to the others. LiveKit, Amazon Chime SDK, and Daily are examples of SFU-based platforms.

For a 1:1 voice agent, P2P is simpler and avoids the cost and complexity of running (or paying for) a media server. I went with P2P using Amazon Kinesis Video Streams (KVS) as the managed TURN relay — this is the documented approach for WebRTC on AgentCore.

I did consider room-based solutions, but each SFU platform requires its own SDK — you can't just connect with a standard RTCPeerConnection. AWS's own WebRTC offering, Amazon Chime SDK, is feature-rich (transcription, recording, analytics) and significantly cheaper than alternatives like LiveKit or Daily, but it doesn't yet offer a paved path for server-side agent-to-room communication. That's a feature I'd love to see, given how compelling the rest of the Chime SDK is. For now, P2P with KVS TURN was the most straightforward path. I'll definitely consider in-room WebRTC, but that's a story for another post.

The WebRTC stack: browser and server

On the browser side, WebRTC is built in. The RTCPeerConnection API is available natively in every modern browser — Chrome, Safari, Firefox, Edge. You create a peer connection, add a microphone track via getUserMedia, and the browser handles audio encoding (Opus), ICE candidate gathering, and DTLS encryption. No libraries needed.

On the server side, it's a different story. WebRTC was designed for browsers, not for Python backends. The go-to library for server-side WebRTC in Python is aiortc — an asyncio-based implementation of WebRTC and ORTC. It handles peer connections, ICE negotiation, and media tracks, and uses PyAV (FFmpeg bindings) for audio/video frame processing. It's not as battle-tested as browser WebRTC, but it works well and is what the AWS sample code uses too.

Architecture: local development vs. deployed

One thing I wanted to preserve from Darryl's original design is the ability to run everything locally for development, without any cloud infrastructure. The WebRTC migration maintains this.

Local mode

In local mode, the agent runs on your machine. The browser and agent are on the same network (or the same machine), so WebRTC connects peer-to-peer without needing a TURN relay. Signaling — the exchange of SDP⁵ offers/answers and ICE⁶ candidates — goes through the Vite dev server proxy to the local FastAPI server.

Deployed mode

In deployed mode, the agent runs inside a Docker container on Bedrock AgentCore Runtime, attached to a VPC via an elastic network interface (ENI) in a private subnet. The browser can't reach the agent directly — all media traffic flows through a KVS TURN relay. Signaling goes through AgentCore's /invocations HTTP endpoint, authenticated with SigV4 via the @aws-sdk/client-bedrock-agentcore SDK.

The following diagram from the AWS documentation shows how it works in terms of networking: signaling flows through AgentCore's HTTP endpoint while media traffic goes through the VPC's NAT gateway to the KVS TURN relay:

The important thing to note is that the agent code is almost identical between local and deployed modes. The BidiAgent, BidiNovaSonicModel, and all four tools (recipe search, timer, nutrition lookup, unit converter) are completely unchanged. The only difference is the transport layer: in local mode, aiortc connects P2P; in deployed mode, it connects through KVS TURN. The agent detects which mode it's in via the CONTAINER_ENV environment variable and configures ICE servers accordingly.

This clean separation was possible because of Strands' BidiInput/BidiOutput protocol. I wrote two small adapter classes — WebRTCBidiInput and WebRTCBidiOutput — that bridge aiortc audio tracks to the event format BidiAgent expects. The agent doesn't know or care whether audio is coming from a WebSocket or a WebRTC track.

What Bedrock AgentCore's WebRTC support adds

On March 20th, 2026, AWS announced WebRTC support for AgentCore Runtime.

I'm not 100% sure, and am ready to stand corrected, but my impression is that the building blocks — VPC network mode, KVS TURN, the /invocations HTTP endpoint — all existed before this announcement. VPC network mode has been available since AgentCore's general availability in October 2025. KVS TURN is a long-standing Kinesis Video Streams feature. And /invocations has always been the standard HTTP endpoint for AgentCore runtimes.

What the March 20th release adds, as far as I can tell, is official documentation, working sample code, and the explicit statement that WebRTC is a supported protocol on AgentCore Runtime. Before this, you could technically have assembled the same pieces yourself, but you'd be on your own — no docs, no samples, no guarantee it would keep working.

What AgentCore does provide is genuinely valuable: managed container hosting with auto-scaling, session isolation between concurrent users, built-in observability (CloudWatch logs, X-Ray traces), and no infrastructure to manage beyond the VPC. I didn't have to set up ECS, configure load balancers, or manage container orchestration.

That said, there's a fair amount of custom code involved. The WebRTC signaling (SDP exchange, ICE candidate management), the aiortc peer connection lifecycle, the audio track bridging to BidiAgent, and the KVS TURN credential management — all of that is application code that I wrote. AgentCore hosts and runs it, but doesn't abstract it away.

Challenges and lessons learned

The migration from WebSocket to WebRTC started as a smooth ride (local mode worked on first attempt!), and was not so smooth afterwards, as I tried to get it to work on Bedrock AgentCore. Here's what tripped me up.

VPC availability zone compatibility

AgentCore Runtime only supports specific availability zones. In us-east-1, only use1-az4 (us-east-1a), use1-az1 (us-east-1c), and use1-az2 (us-east-1d) are supported. I initially let Terraform pick the first two AZs automatically, which gave me us-east-1a and us-east-1b. The runtime update failed with a cryptic UPDATE_FAILED status. The actual error message — mentioning the unsupported AZ — was buried in the failureReason field of the API response, not surfaced in the Terraform error. I ended up hardcoding the supported AZs in my VPC module.

Session affinity

This one cost me hours. WebRTC signaling is a multi-step handshake — the browser and agent exchange several messages to establish a connection. The agent needs to remember the connection state from the first message when processing the second and third. If those messages land on different server instances, the agent has no memory of the ongoing handshake and the connection fails.

I initially used raw SigV4-signed HTTP POST requests, assuming that including the session ID as a query parameter would provide routing affinity. It didn't. The ICE candidates were landing on a different container instance (?) than the one holding the peer connection.

The fix was to use the @aws-sdk/client-bedrock-agentcore SDK with InvokeAgentRuntimeCommand and the runtimeSessionId parameter. This is the only reliable way to ensure all requests for a WebRTC session reach the same container instance. The AWS sample code uses this pattern too — I just didn't notice it at first because I was focused on the WebRTC parts.

SDP candidate filtering

When the agent creates a peer connection inside the VPC, aiortc generates ICE candidates for all available network interfaces — including VPC-internal IPs like 169.254.0.2. These host candidates end up in the SDP answer sent to the browser. The browser dutifully tries to connect to them, fails (because they're unreachable from the public internet), and only then falls back to the relay candidates. This adds several seconds to the connection time.

The fix is straightforward: strip non-relay candidates from the SDP answer before returning it to the browser. In deployed mode, the only candidates that can work are TURN relay candidates, so there's no reason to include the others.

TURN-only mode

Similar to the SDP filtering issue, the agent's aiortc instance tries host candidates before relay candidates by default. Since host candidates use VPC-internal IPs that can never work from the browser's perspective, this wastes time. Configuring aiortc to only use TURN relay candidates (turn_only=True) skips straight to the candidates that actually work.

Lazy KVS initialization

I initially called kvs.init() at module import time, guarded by an if IS_CONTAINER check. This worked fine locally but caused the container to crash on AgentCore. The KVS API call to find or create the signaling channel requires AWS credentials, and during container startup there can be a brief delay before the IAM role credentials are available. Moving the initialization to the first actual request (lazy init) fixed the crash.

Cold start behavior

After the container has been idle for a while, the first WebRTC connection attempt sometimes fails. The signaling requests succeed (AgentCore returns 200), but the ICE connection never completes. I suspect this is related to AgentCore spinning up a fresh container instance — the first few requests may be handled by an instance that isn't fully warmed up. On the agent side, I explicitly set --workers 1 in the uvicorn command to ensure all requests within a container hit the same process (and therefore the same in-memory peer connection state). On the frontend, I added a retry mechanism: wait for ICE to reach "connected" state, and if it doesn't within 10 seconds, tear down and retry with a new session ID. Together, these made the connection reliable.

Key code

I won't walk through every file, but here are the pieces that make the WebRTC integration work.

The WebRTCBidiInput adapter reads audio frames from the aiortc track, resamples them to 16kHz, and returns them as bidi_audio_input events that BidiAgent understands:

class WebRTCBidiInput:
    def __init__(self, track):
        self._track = track

    async def __call__(self):
        try:
            frame = await self._track.recv()
        except MediaStreamError:
            raise StopAsyncIteration
        resampled = _resampler.resample(frame)
        pcm = b"".join(f.planes[0] for f in resampled)
        return {
            "type": "bidi_audio_input",
            "audio": base64.b64encode(pcm).decode("utf-8"),
            "sample_rate": 16000,
        }

The WebRTCBidiOutput adapter does the reverse — it receives events from BidiAgent and pushes audio to the aiortc output track:

class WebRTCBidiOutput:
    def __init__(self, output_track):
        self._output_track = output_track

    async def __call__(self, event):
        if event.get("type") == "bidi_audio_stream":
            audio_bytes = base64.b64decode(event["audio"])
            self._output_track.add_audio(audio_bytes)
        elif event.get("type") == "bidi_interruption":
            self._output_track.clear()

On the frontend, the useWebRTCSession hook uses the AgentCore SDK for signaling:

const invoke = async (action, data = {}) => {
  const client = new BedrockAgentCoreClient({ region, credentials });
  const resp = await client.send(new InvokeAgentRuntimeCommand({
    agentRuntimeArn,
    runtimeSessionId: sessionId,  // ensures session affinity
    contentType: 'application/json',
    payload: new TextEncoder().encode(JSON.stringify({ action, data })),
  }));
  return JSON.parse(new TextDecoder().decode(
    await resp.response.transformToByteArray()
  ));
};

The full source is in the repo — the feat/webrtc branch has the local-only version, and feat/webrtc-agentcore has the full deployed version with Terraform.

Development tooling

I built this project using Kiro CLI, Amazon's AI development assistant. It handled the planning, code generation, debugging, and iterative deployment — including the many rounds of trial-and-error with WebRTC configuration that this post describes. The back-and-forth between writing code, deploying, checking logs, and fixing issues was a natural fit for an AI pair-programming workflow.

Try it yourself

To run locally:

git clone https://github.com/psantus/strands-bidir-nova.git
cd strands-bidir-nova
git checkout feat/webrtc
uv sync && make install-frontend
# Terminal 1:
make serve
# Terminal 2:
make serve-frontend

Open http://localhost:5173, click the microphone, and start talking.

For the deployed version on AgentCore, check out the feat/webrtc-agentcore branch and follow the README. You'll need a Bedrock Knowledge Base with some recipes, a Cognito user pool, and Docker for building the container image. A single terraform apply handles the rest.

If you'd rather start with the WebSocket version first, Darryl Ruggles' original post is the place to go.

Paul Santus is an independent cloud consultant at TerraCloud. He helps organizations build and deploy AI-powered applications on AWS. Connect with him on LinkedIn.

WebRTC (Web Real-Time Communication) — An open standard for real-time audio, video, and data communication directly between browsers and devices, using UDP-based transport. ↩
TURN (Traversal Using Relays around NAT) — A relay server that forwards media traffic when two peers can't connect directly. Both sides send their audio to the TURN server, which relays it to the other side. ↩
NAT (Network Address Translation) — A networking mechanism that maps private IP addresses to public ones. Most home routers and cloud VPCs use NAT, which prevents direct inbound connections. ↩
SFU (Selective Forwarding Unit) — A media server that receives audio/video tracks from participants and selectively forwards them to others, without mixing or transcoding. Used by LiveKit, Chime SDK, Daily, etc. ↩
SDP (Session Description Protocol) — A text format describing a multimedia session: codecs, transport addresses, and media types. In WebRTC, peers exchange SDP "offers" and "answers" to negotiate the connection. ↩
ICE (Interactive Connectivity Establishment) — A protocol for finding the best network path between two peers. It gathers candidate addresses (local, server-reflexive, relay) and tests connectivity between them. ↩

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/28DailyView insight →

[Boost]

Dev.to

Managing LLM context in a real application

Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Dev.to

The Redline Economy

Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Key Points

Why WebRTC for a voice agent

A quick primer on WebRTC architectures

Peer-to-peer (P2P)

Room-based (SFU)

The WebRTC stack: browser and server

Architecture: local development vs. deployed

Local mode

Deployed mode

What Bedrock AgentCore's WebRTC support adds

Challenges and lessons learned

VPC availability zone compatibility

Session affinity

SDP candidate filtering

TURN-only mode

Lazy KVS initialization

Cold start behavior

Key code

Development tooling

Try it yourself

💡 Insights using this article

Related Articles

[Boost]

Managing LLM context in a real application

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

The Redline Economy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer