Qwen3-TTS + qwen3.6-35B for a voice agent pipeline — 3 weeks of notes

Reddit r/LocalLLaMA / 4/24/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The author describes building a local, voice-first assistant that uses Whisper → Qwen3.6 (LLM) → Qwen3-TTS, with the goal of producing conversational responses rather than “typing-like” pauses.
  • They report that Qwen3-TTS significantly improves expressiveness and intonation for short, back-and-forth phrases compared with Kokoro, and avoids the multi-second cold-start latency they sometimes saw with XTTS-v2.
  • On the LLM side, using Qwen3.6-35B-A3B is credited with “thinking preservation” across turns, helping multi-turn voice sessions maintain context instead of resetting each exchange.
  • The pipeline’s overall round-trip latency is described as workable for real-time conversation, though not instantaneous, and the author says it no longer feels like a broken pause mid-sentence.
  • A remaining challenge is handling tool calls inside the voice loop, where retrieval/tool execution creates a gap before TTS can start; the author is seeking approaches to stream partial text while waiting for tool results.

Saw the Qwen3-TTS thread this morning and it finally pushed me to write this up.

Background: ive been building a local voice assistant for a client over the past 3 weeks. Voice-first interface on top of a RAG backend -- use case is an AI assistant where they need responses that feel conversational, not a typing test where you wait for the cursor to stop.

TTS was the weak link. Tried Kokoro first, which is solid for narration but gets flat on short phrases like "got it" or "sure, one sec" -- the kind of back and forth that dominates voice interfaces. XTTS-v2 was more expressive but cold start latency was sometimes 4-6 seconds depending on GPU state, which kills the flow.

Swapped in Qwen3-TTS this past week and the difference is real. Expressiveness on question intonation improved noticeably. Proper nouns and acronyms are still a bit inconsistent, but for general conversation it doesnt feel robotic anymore -- first local TTS model where ive been able to just leave it running without the urge to swap something.

On the LLM side: [Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B). The thinking preservation across turns is what makes it actually work for voice sessions. Previous reasoning carries forward so multi-turn context compounds instead of resetting every time. Matters a lot when users reference something from 7 exchanges ago.

Full pipeline is whisper -> qwen3.6 -> qwen3-TTS. Round trip latency is workable. Not instant, but it doesnt feel like a broken pause mid-sentence.

One thing still unsolved: tool calls inside the voice loop. When the user asks something that needs a retrieval step, there's a gap before TTS can start. Haven't found a clean way to stream partial response text before the tool result comes back. If anyone's gotten that working, genuinely curious how.

submitted by /u/ecompanda
[link] [comments]