The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these models as of now.
I was able to extract the audio encoder from the official model repository on huggingface, and vibe-code a bridge that passes on the embeddings of the audio directly to the model, and it actually works as well. This system uses the Unsloth's GGUF version at Q4 and the audio encoder at full precision (pytorch), and takes up about 5.5-6GB VRAM.
The thing is that this entire thing feels like a workaround for what should be readily available, and built in a more robust way, and not vibe-coded by someone like me.
Maybe I am just unaware, but I am looking for a more complete and non-hacky way of using the model's multimodal capabilities under 6GB VRAM. So if anyone can guide me with this please it would be awesome!
P.s : I tried mistral.rs but for multimodal capabilities I guess it takes a lot of extra VRAM for some reason?
[link] [comments]



