| Something I built. A conversational LLM chatbot, using speech-to-text and text-to-speech interfaces. The design goal was maximum conversational realism and engagement in a resource-constrained environment. In this demo, everything runs on a single RTX 3080 Mobile GPU with 16 GB VRAM total. Minimal system RAM usage and no Python dependencies. All components are built in C++ for speed. Components include: 1) Qwen3.5-9B UD-Q6_K_XL (GGUF)- LLM running on a (slightly) customized talk-llama.cpp example from GGML.org's whisper.cpp. Customizations include an ability to set KV cache quantization levels, as well as additional Qwen3.5 generation parameters (repeat-penalty, presence-penalty) to optimize text generation. Context is 49152 tokens - enough for a couple of hours of conversational turns. Latency between user voice input and system voice output is still somewhat high when longer blocks of text are generated by the system, but this is still pretty good for a GPU released in 2021 (!). [link] [comments] |
You can do a lot with an old mobile GPU these days
Reddit r/LocalLLaMA / 3/26/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- A developer demonstrates a fully local conversational LLM chatbot (speech-to-text + LLM + text-to-speech) running on a single RTX 3080 Mobile GPU with 16GB VRAM, with minimal system RAM usage and no Python dependencies.
- The setup combines Qwen 3.5 9B for dialogue generation (customized talk-llama.cpp with configurable KV cache quantization), Whisper-small for speech-to-text, and Orpheus-3B-finetuned for emotive text-to-speech.
- Custom C++ tooling is used to efficiently convert TTS tokens into audio via an optimized SNAC decoder through ONNX Runtime, enabling chunked audio generation and playback directly from RAM.
- The demo targets maximum conversational realism using extensive A/B-tested system prompts and tuned generation parameters, achieving reasonable latency despite using an older 2021-era mobile GPU.
- Overall, the post suggests that modern local AI voice assistants are increasingly feasible on older consumer GPU hardware through quantization, C++ implementations, and tight runtime integration.
Related Articles
Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.
Mistral AI Blog
Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)
Dev.to
How to Use MiMo V2 API for Free in 2026: Complete Guide
Dev.to
The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context
Dev.to
Why We Ditched 6 APIs and Built One MCP Server for Our Entire Ecommerce Stack
Dev.to