A tool to re-voice videos via Ollama, Qwen3-tts and translategemma

Reddit r/LocalLLaMA / 3/19/2026

💬 OpinionTools & Practical Usage

Read original →

共有:

Key Points

Voicer is an open-source desktop tool that automates translation and voiceover for videos by integrating Ollama, translategemma, and Qwen3-TTS to clone voices locally.
It uses translategemma:27b to clean subtitles, adapt text, translate into a target language, and re-clean for narration, followed by Qwen3-TTS to generate speech that mimics a reference voice.
The project runs locally (via Ollama or LM Studio), supports batch processing by sentences, and includes a custom pronunciation dictionary for stress control.
It was originally built for personal use and wrapped into a user-friendly desktop app so others don’t have to CLI-work, with an optional CLI for automation and pipelines.

A tool to re-voice videos via Ollama, Qwen3-tts and translategemma

https://preview.redd.it/h1thbwyh0vpg1.png?width=780&format=png&auto=webp&s=ed003920197dad29320430777da1581a1d628f01

Hi everyone,

Sorry if this format is not good for Reddit, it's just my style to blog, maybe I needed to post it to another portal, IDK

So let's start from the reason of the story:

About 2 years ago I've translated via voice clonging 19784 quests of World Of Warcraft using local models into Russian. Recently I revived my Youtube and started posting stream highlights about programming. While experimenting, I re-voiced a Fireship video about OpenClaw — and that’s where the idea evolved into something bigger: digital avatars and voice replacements.

So I started thinking…

Yes, I can watch videos in English just fine. But I still prefer localized voiceovers (like Vert Dider over original Veritasium). And then I thought — why not do this myself?

Right, because I’m too lazy to do it manually 😄

So instead, I automated a process that should take ~15 minutes… but I spent hours building tooling for it. Classic programmer logic.

The post is the translation of my post at Russian alternative for Reddit -> Habr (the link to the original post), sorry for my English anyway.

Final Result

Voicer (open-source): A tool that automates translation + voiceover using cloned voices.

I originally built it for myself, but wrapped it into a desktop app so others don’t have to deal with CLI if they don’t want to.

It runs locally via Ollama (or you can adapt it to LM Studio or anything else).

What It Does

Desktop app (yeah, Python 😄)
Integrated with Ollama
Uses one model (I used translategemma:27b) to:
- clean raw subtitles
- adapt text
- translate into target language
- clean/adapt again for narration
Uses another model (Qwen3-TTS) to:
- generate speech from translated text
- mimic a reference voice
Batch processing (by sentences)
Custom pronunciation dictionary (stress control)
Optional CLI (for automation / agents / pipelines)

How It Works (Simplified Pipeline)

Extract subtitles

Download captions from YouTube (e.g. via downsub)

https://preview.redd.it/0jpjuvrivupg1.png?width=767&format=png&auto=webp&s=be5fcae7258c148a94f2e258a19531575be23a43

Clean the text

https://preview.redd.it/pc8p8nmjvupg1.png?width=780&format=png&auto=webp&s=3729a24b1428a7666301033d9bc81c8007624002

Subtitles are messy — duplicates, broken phrasing, etc.

You can:

clean manually
use GPT
or (like me) use local models

3-Step Translation Pipeline

I used a 3-stage prompting approach:

Clean broken English

You are a text editor working with YouTube transcripts. Clean the following transcript while preserving the original meaning. Rules: - Merge broken sentences caused by subtitle line breaks - Remove duplicated words or fragments - Fix punctuation - Keep the original wording as much as possible - Do not summarize or shorten the text - Do not add commentary Output only the cleaned English transcript. Transcript:

Translate carefully

You are an expert translator and technical writer specializing in programming and software engineering content. Your task is to translate the following English transcript into natural Russian suitable for a YouTube tech video narration. Important: This is a spoken video transcript. Guidelines: 1. Preserve the meaning and technical information. 2. Do NOT translate literally. 3. Rewrite sentences so they sound natural in Russian. 4. Use clear, natural Russian with a slightly conversational tone. 5. Prefer shorter sentences suitable for narration. 6. Keep product names, libraries, commands, companies, and technologies in English. 7. Adapt jokes if necessary so they sound natural in Russian. 8. If a direct translation sounds unnatural, rewrite the sentence while preserving the meaning. 9. Do not add commentary or explanations. Formatting rules: - Output only the Russian translation - Keep paragraph structure - Make the result suitable for voice narration Text to translate:

Adapt text for natural speech

You are editing a Russian translation of a programming YouTube video. Rewrite the text so it sounds more natural and fluid for voice narration. Rules: - Do not change the meaning - Improve readability and flow - Prefer shorter spoken sentences - Make it sound like a developer explaining technology in a YouTube video - Remove awkward phrasing - Keep technical names in English - Do not add explanations or commentary Output only the final Russian narration script. Text:

Prompts are simple, nothing fancy — just works.

Voice Generation

ofc I needed an option to be able to catch metrics, but generally it's also working without mlflow. Mlflow is tool to catch openai compatibile calls to be able to track tokenomic and so on

Uses translategemma (found advices on Reddit to use it)
Requires:
- reference audio (voice sample)
- matching reference text
Output: cloned voice speaking translated text

Signature for cli is the following:

poetry run python src/python/translate_with_gemma.py [input.txt] [-o output.txt]

MLFLOW_TRACKING_URI=http://localhost:5001 poetry run python src/python/translate_with_gemma.py [input.txt] [-o output.txt]

Important:

Better input audio = better cloning
Noise gets cloned too
You can manually tweak pronunciation

For example:

step 1

https://preview.redd.it/ymtkgogawupg1.png?width=780&format=png&auto=webp&s=f00c7fae927d8d25d4f61bf24e18b34f8ac001a4

step 2

https://preview.redd.it/0ttbq3cbwupg1.png?width=780&format=png&auto=webp&s=bf3150fcbddaa51421fdbf4cd56fc46663ed9e1b

step 3

https://preview.redd.it/m3dc5w3cwupg1.png?width=780&format=png&auto=webp&s=e62848f1be86cf9e081ecd7252fa79a1c55e9eac

and the difference

The main goal of prompts is to reduce amount of repeatable staff and get rid of constructions that not used in standard speaking mode at YouTube

Some Observations

Large models (27B) are slow — smaller ones are more practical
Batch size matters — too large → hallucinations mid-generation
Sometimes reloading the model is actually better than long runs
On macOS:
- metal-attention exists but is messy, I've also tried to adopt the aule-attention, but it doesn't work well with Qwen3-tts, so I can share code if it's needed
Voice cloning:
- works best with clean speech
- accent quirks get amplified 😄 (I will attach to the comment the link)

so 2 minutes before it's done (all my dotfiles ofc here http://github.com/the-homeless-god/dotfiles

The first result is done, I've used my voice from recent video to voiceover FireShip to Russian

And ofc I've prepared reference text well

Logseq knowledge base

Later I've finished with local ollama staff related for python app, github actions and other building staff

A lot of snakes & pythons

And on finish just to debug pipes

https://preview.redd.it/x20w17uzwupg1.png?width=780&format=png&auto=webp&s=ce066e016ee9208812220ce31d0beff8eaf38a04

Some issues are happened with linux image, but I think other guys can easily contribute via PRs

CI/CD brings artifacts on tags

https://preview.redd.it/t9ak5zy4xupg1.png?width=780&format=png&auto=webp&s=9f3942a8165485f2f03af5273d175e31a96eff66

I don't have ideas how to solve the verification of binaries, but maybe to publish it to AppStore? WDYT?