Qwen3-TTS but in OpenVINO, from scratch

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author has released a codebase that implements Qwen3-TTS and converts it into OpenVINO IR format, sharing the work after merging it to OpenArc in March 2026.
Instead of relying on Transformers for core logic, they rebuilt the TTS pipeline in PyTorch from scratch to better understand how to design an OpenVINO-friendly conversion.
Their main engineering lesson is that conversion quality depends on analyzing data flow in the nn.Module and iteratively adjusting device placement so the OpenVINO compiler selects optimal kernels and fusions.
They note that custom kernels are a separate future task, and that stateful components like kv-cache are particularly difficult to get right without extensive guidance (even for AI assistance).
The repository currently targets the 1.7B model size for CPU and GPU in OpenArc, and invites community contributions for NPU support and potential benchmarking versus PyTorch.

Hello everyone,

I finally got around to preparing my implementation of Qwen3-TTS in OpenVINO format as a codebase. This work was done in early 2026, merged to OpenArc in March and I kept forgetting about releasing the code. Here we are. https://github.com/SearchSavior/Qwen3-TTS-OpenVINO

One guy from our discord speaks russian and I wanted to voice clone elmo on my A770,so I decided to from scratch Qwen3-TTS in pytorch, ignoring transformers (except for AutoTokenizer, my beloved) to really get inside how you design an OpenVINO conversion to their model format.

The key learning is: you take an nn.Module with some logic, it's forward method, study the data flow, then iterate until you find the combination of data flow and device placement which lets the openvino compiler choose the best kernels. Interfering with this process ie, custom kernels is a totally seperate mission for future work. There were a ton of steps in between, and a key learning for me in this project was taking better notes.

AI assistance was used... but honestly I'm not sure how it could be done without it. Even Opus 4.5 could not make good openvino flavored choices, especially around stateful kv cache and could not anticipate kernel fusion without extensive guidance. Intel does not put enough effort into documenting their engineering practices... which makes openvino feel not so open after all. BUT, with AI tools and some effort, it is possible.

This codebase can be generalized for optimizing any pytorch model for openvino IR format. I tried to make sure the code is easy to follow, but it is quite demanding conceptually, drawing on poorly documented openvino concepts Opus implemented based on targeted examples from the upstream source I was able to conjure from memory, with hours of testing on top. Though AI assisted, this code was in no way full send vibe coded.

It's all live in OpenArc now, covering only 1.7B size for CPUs and GPUs; I had issues with 0.6B I did not investigate further. NPU support PRs are most welcome.

Unlike other implementation posts, I haven't included any benchmarks mostly due to time constraints plus changes I made to the inference code in the OpenArc PR vs what's in this repo. If there is interest we can bench OpenArc vs pytorch cpu/xpu.

submitted by /u/Echo9Zulu-
[link] [comments]