anyone got audio working in small gemma-4 models ???

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A Reddit user reports difficulties enabling an audio pipeline (VAD → LLM → TTS) with small Gemma-4 models, finding that audio “refuses to work” across multiple setups.
  • They tried several llama.cpp builds and Unsloth Studio without success, suggesting current community stacks may not support the needed audio flow reliably for these models.
  • The only working option they found is Google’s LiteRT LM, but it reportedly forces CPU-only inference when audio is involved and significantly hurts performance.
  • They note that a GPU implementation appears to be pending on GitHub and ask the community for workarounds or alternative stacks that actually function.
  • The post highlights a practical gap in local/offline audio-capable LLM workflows for small Gemma-4 deployments, especially regarding GPU acceleration and end-to-end audio handling.

Trying pipeline

VAD speech chunk > LLM > TTS

skipping ASR part completely

but audio just refuses to work

tried multiple llama.cpp builds and unsloth studio
no luck so far

only thing that works is LiteRT LM by google
but it forces cpu only inference when audio is involved
and it kills performance

saw on Github that gpu implementation is still pending

any workaround or different stack that actually works ???

submitted by /u/KokaOP
[link] [comments]