Do you use LLM's with TTS and speech recognition?

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post asks whether people interact with LLMs using speech recognition for input and TTS for audio output.
The author describes their local setup using Fast-Kokoro (TTS), Koboldcpp with a Whisper model (speech recognition), and a Gemma 4 small E4B model via SillyTavern.
They report that the system feels close to real time on an RTX 4060 Ti (16 GB VRAM) with 32 GB RAM, making voice conversation practical.
The author seeks community feedback on how common this voice-driven LLM workflow is and whether others use it routinely or rarely.

As the title says, do you talk to your LLM using speech recognition and listen back its answers with TTS models?

Last night I didn't slept much so I sit on computer and installed Fast-Kokoro for TTS and configured Koboldcpp using Whisper model and so far it seems to be great experience with SillyTavern and Gemma 4 small E4B model.

I have RTX 4060 Ti with 16 GB VRAM and 32 GB of RAM and with this setup (SillyTavern + Koboldcpp + Whisper + Gemma 4-E4B + Fast Kokoro) it is almost real time, so it is relistic to use for talking with voice.

Since this is quite new to me (previously only used TTS long time ago for testing), I was wondering how others here are doing. Do you talk to your LLM's or is it more rare use case?

submitted by /u/film_man_84
[link] [comments]