You can do a lot with an old mobile GPU these days

Reddit r/LocalLLaMA / 3/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A developer demonstrates a fully local conversational LLM chatbot (speech-to-text + LLM + text-to-speech) running on a single RTX 3080 Mobile GPU with 16GB VRAM, with minimal system RAM usage and no Python dependencies.
  • The setup combines Qwen 3.5 9B for dialogue generation (customized talk-llama.cpp with configurable KV cache quantization), Whisper-small for speech-to-text, and Orpheus-3B-finetuned for emotive text-to-speech.
  • Custom C++ tooling is used to efficiently convert TTS tokens into audio via an optimized SNAC decoder through ONNX Runtime, enabling chunked audio generation and playback directly from RAM.
  • The demo targets maximum conversational realism using extensive A/B-tested system prompts and tuned generation parameters, achieving reasonable latency despite using an older 2021-era mobile GPU.
  • Overall, the post suggests that modern local AI voice assistants are increasingly feasible on older consumer GPU hardware through quantization, C++ implementations, and tight runtime integration.
You can do a lot with an old mobile GPU these days

Something I built. A conversational LLM chatbot, using speech-to-text and text-to-speech interfaces. The design goal was maximum conversational realism and engagement in a resource-constrained environment.

In this demo, everything runs on a single RTX 3080 Mobile GPU with 16 GB VRAM total. Minimal system RAM usage and no Python dependencies. All components are built in C++ for speed.

Components include:

1) Qwen3.5-9B UD-Q6_K_XL (GGUF)- LLM running on a (slightly) customized talk-llama.cpp example from GGML.org's whisper.cpp. Customizations include an ability to set KV cache quantization levels, as well as additional Qwen3.5 generation parameters (repeat-penalty, presence-penalty) to optimize text generation. Context is 49152 tokens - enough for a couple of hours of conversational turns.
2) Whisper-small (GGUF) model for accurate STT, running on talk-llama.cpp.
3) Orpheus-3B-ft UD-Q4_K_XL (GGUF) - A leading local text-to-speech model with the popular "Tara" voice, running on llama-server from GGML.org's llama.cpp. Includes the capability to generate emotive tags e.g. laugh, chuckle, sigh, etc.
4) Custom-written "orpheus-speak" C++ app to rapidly convert the speech tokens generated by the Orpheus TTS to audio using an optimized snac24_dynamic_fp16 (community-sourced) decoder over an ONNX runtime. The decoder stays warm between utterances, and audio WAV data is written directly to and played from RAM in 3-sentence chunks, allowing for accurate and (relatively) rapid audio generation across long text blocks.
5) An extensively A/B tested system prompt allowing for natural-sounding, engaging conversations, compiled into talk-llama.cpp.
6) A launcher shell script optimizing context and generation parameters across all neural nets (LLM, STT, TTS, decoder) running on the GPU.

Latency between user voice input and system voice output is still somewhat high when longer blocks of text are generated by the system, but this is still pretty good for a GPU released in 2021 (!).

submitted by /u/Responsible_Fig_1271
[link] [comments]