| Heya guys and gals, Around a year ago I released and posted about Persona Engine as a fun side project, trying to get the whole ASR -> LLM -> TTS pipeline going fully locally while having a realtime avatar that is lip-synced (think VTuber). I was able to achieve this and was super happy with the result, but the TTS for me was definitely lacking, since I was using Sesame at the time as reference. After that I took a long break. A week or two ago, I thought to give the project a refresh, and also wanted to see how far we have come with local models, and boy was I pleasantly surprised with Qwen3 TTS. During my initial tests it was lacking, especially the version published by the Qwen team themselves, but after digging around and experimenting a lot I was able to:
Once this was all done, I also decided to finetune my own Qwen3-TTS voice. The cloning capabilities are really cool, but very lacking in contextual understanding and struggles with pronouncing. Additionally, the custom trained voices provided by the Qwen team didn't have any female native speakers, and I didn't want to create a new Live2D model. In the end, the finetune blew me away and will probably continue improving it. GitHub is here: https://github.com/fagenorn/handcrafted-persona-engine Check it out, have fun, and let me know whatever crazy stuff you decide to do with it. [link] [comments] |
Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried
Reddit r/LocalLLaMA / 4/23/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A developer revisited a fully local ASR→LLM→TTS pipeline for a lip-synced “realtime avatar” (VTuber-like) and found Qwen3 TTS to be far more capable than earlier setups.
- They report that Qwen3 TTS streams reliably thanks to the model’s decoder design (sliding window), helping maintain consistent prosody, pitch, and intonation during streaming.
- The model was made to work with llama.cpp by using quantization, enabling local real-time performance in a C# workflow.
- Because Qwen3 TTS lacked word-level timings and phoneme outputs compared with a prior TTS system (Kokoro), they implemented CTC-based word-level alignment to drive subtitles and more accurate lip movement.
- After these integration steps, they fine-tuned their own Qwen3-TTS voice and say the fine-tuning significantly improved expressiveness and practicality for their use case.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.


