https://preview.redd.it/h1thbwyh0vpg1.png?width=780&format=png&auto=webp&s=ed003920197dad29320430777da1581a1d628f01
Hi everyone,
Sorry if this format is not good for Reddit, it's just my style to blog, maybe I needed to post it to another portal, IDK
So let's start from the reason of the story:
About 2 years ago I've translated via voice clonging 19784 quests of World Of Warcraft using local models into Russian. Recently I revived my Youtube and started posting stream highlights about programming. While experimenting, I re-voiced a Fireship video about OpenClaw — and that’s where the idea evolved into something bigger: digital avatars and voice replacements.
So I started thinking…
Yes, I can watch videos in English just fine. But I still prefer localized voiceovers (like Vert Dider over original Veritasium). And then I thought — why not do this myself?
Right, because I’m too lazy to do it manually 😄
So instead, I automated a process that should take ~15 minutes… but I spent hours building tooling for it. Classic programmer logic.
The post is the translation of my post at Russian alternative for Reddit -> Habr (the link to the original post), sorry for my English anyway.
Final Result
Voicer (open-source): A tool that automates translation + voiceover using cloned voices.
I originally built it for myself, but wrapped it into a desktop app so others don’t have to deal with CLI if they don’t want to.
It runs locally via Ollama (or you can adapt it to LM Studio or anything else).
What It Does
- Desktop app (yeah, Python 😄)
- Integrated with Ollama
- Uses one model (I used
translategemma:27b) to: - clean raw subtitles
- adapt text
- translate into target language
- clean/adapt again for narration
- Uses another model (
Qwen3-TTS) to: - generate speech from translated text
- mimic a reference voice
- Batch processing (by sentences)
- Custom pronunciation dictionary (stress control)
- Optional CLI (for automation / agents / pipelines)
How It Works (Simplified Pipeline)
- Extract subtitles
Download captions from YouTube (e.g. via downsub)
https://preview.redd.it/0jpjuvrivupg1.png?width=767&format=png&auto=webp&s=be5fcae7258c148a94f2e258a19531575be23a43
- Clean the text
https://preview.redd.it/pc8p8nmjvupg1.png?width=780&format=png&auto=webp&s=3729a24b1428a7666301033d9bc81c8007624002
Subtitles are messy — duplicates, broken phrasing, etc.
You can:
- clean manually
- use GPT
- or (like me) use local models
- 3-Step Translation Pipeline
I used a 3-stage prompting approach:
Clean broken English
You are a text editor working with YouTube transcripts. Clean the following transcript while preserving the original meaning. Rules: - Merge broken sentences caused by subtitle line breaks - Remove duplicated words or fragments - Fix punctuation - Keep the original wording as much as possible - Do not summarize or shorten the text - Do not add commentary Output only the cleaned English transcript. Transcript:
Translate carefully
You are an expert translator and technical writer specializing in programming and software engineering content. Your task is to translate the following English transcript into natural Russian suitable for a YouTube tech video narration. Important: This is a spoken video transcript. Guidelines: 1. Preserve the meaning and technical information. 2. Do NOT translate literally. 3. Rewrite sentences so they sound natural in Russian. 4. Use clear, natural Russian with a slightly conversational tone. 5. Prefer shorter sentences suitable for narration. 6. Keep product names, libraries, commands, companies, and technologies in English. 7. Adapt jokes if necessary so they sound natural in Russian. 8. If a direct translation sounds unnatural, rewrite the sentence while preserving the meaning. 9. Do not add commentary or explanations. Formatting rules: - Output only the Russian translation - Keep paragraph structure - Make the result suitable for voice narration Text to translate:
Adapt text for natural speech
You are editing a Russian translation of a programming YouTube video. Rewrite the text so it sounds more natural and fluid for voice narration. Rules: - Do not change the meaning - Improve readability and flow - Prefer shorter spoken sentences - Make it sound like a developer explaining technology in a YouTube video - Remove awkward phrasing - Keep technical names in English - Do not add explanations or commentary Output only the final Russian narration script. Text:
Prompts are simple, nothing fancy — just works.
- Voice Generation
ofc I needed an option to be able to catch metrics, but generally it's also working without mlflow. Mlflow is tool to catch openai compatibile calls to be able to track tokenomic and so on
- Uses translategemma (found advices on Reddit to use it)
- Requires:
- reference audio (voice sample)
- matching reference text
- Output: cloned voice speaking translated text
Signature for cli is the following:
poetry run python src/python/translate_with_gemma.py [input.txt] [-o output.txt]
or
MLFLOW_TRACKING_URI=http://localhost:5001 poetry run python src/python/translate_with_gemma.py [input.txt] [-o output.txt]
Important:
- Better input audio = better cloning
- Noise gets cloned too
- You can manually tweak pronunciation
For example:
step 1
https://preview.redd.it/ymtkgogawupg1.png?width=780&format=png&auto=webp&s=f00c7fae927d8d25d4f61bf24e18b34f8ac001a4
step 2
https://preview.redd.it/0ttbq3cbwupg1.png?width=780&format=png&auto=webp&s=bf3150fcbddaa51421fdbf4cd56fc46663ed9e1b
step 3
https://preview.redd.it/m3dc5w3cwupg1.png?width=780&format=png&auto=webp&s=e62848f1be86cf9e081ecd7252fa79a1c55e9eac
and the difference
The main goal of prompts is to reduce amount of repeatable staff and get rid of constructions that not used in standard speaking mode at YouTube
Some Observations
- Large models (27B) are slow — smaller ones are more practical
- Batch size matters — too large → hallucinations mid-generation
- Sometimes reloading the model is actually better than long runs
- On macOS:
- metal-attention exists but is messy, I've also tried to adopt the aule-attention, but it doesn't work well with Qwen3-tts, so I can share code if it's needed
- Voice cloning:
- works best with clean speech
- accent quirks get amplified 😄 (I will attach to the comment the link)
so 2 minutes before it's done (all my dotfiles ofc here http://github.com/the-homeless-god/dotfiles
The first result is done, I've used my voice from recent video to voiceover FireShip to Russian
And ofc I've prepared reference text well
Logseq knowledge base
Later I've finished with local ollama staff related for python app, github actions and other building staff
A lot of snakes & pythons
And on finish just to debug pipes
https://preview.redd.it/x20w17uzwupg1.png?width=780&format=png&auto=webp&s=ce066e016ee9208812220ce31d0beff8eaf38a04
Some issues are happened with linux image, but I think other guys can easily contribute via PRs
CI/CD brings artifacts on tags
https://preview.redd.it/t9ak5zy4xupg1.png?width=780&format=png&auto=webp&s=9f3942a8165485f2f03af5273d175e31a96eff66
I don't have ideas how to solve the verification of binaries, but maybe to publish it to AppStore? WDYT?
https://preview.redd.it/vq16kbn7xupg1.png?width=481&format=png&auto=webp&s=3875b4df36bb0fe05e5d98e5e612b896aa163b5a
Desktop Features
Local execution from binary works well with translation
but needed to run in Package Contents the file to be able to call Qwen3-tts, it's just attaching to local Ollama
- Translate + voice OR voice-only mode
- Language selection
- Batch & token control
- Model selection (translation + TTS)
- Reference audio file picker
- Logs
- Prompt editor
- Pronunciation dictionary
- Output folder control
- Multi-window output view
https://preview.redd.it/n9sjen6exupg1.png?width=780&format=png&auto=webp&s=381dae851703775f67330ecf1cd48d02cb8f2d1d
Main goal:
Make re-voicing videos fast and repeatable
Secondary goal:
Eventually plug this into:
- OpenClaw
- n8n pipelines
- automated content workflows
Future Ideas
- Auto-dubbing videos via pipelines
- AI agents that handle calls / bookings
- Re-voicing anime (yes, seriously 😄)
- Digital avatars
Notes
- It’s a bit messy (yes, it’s Python)
- Built fast, not “production-perfect”
- Open-source — PRs welcome
- Use it however you want (commercial too)
https://preview.redd.it/9kywz29fxupg1.png?width=780&format=png&auto=webp&s=c4314bb75b85fc2b4491662da8792edd4f3c7ffc
If you’ve got ideas for experiments — drop them in comments, thx if you read at the end, let me know if it's ok to post something like that next time
GitHub: https://github.com/the-homeless-god/voicer
submitted by