it looks like it will be soon ๐Ÿ’Ž๐Ÿ’Ž๐Ÿ’Ž๐Ÿ’Ž

Reddit r/LocalLLaMA / 4/2/2026

๐Ÿ’ฌ OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Read original โ†’
ๅ…ฑๆœ‰:

Key Points

  • A new llama.cpp pull request adds support that appears to target Hugging Faceโ€™s Gemma 4 multimodal model updates.
  • Gemma 4 is described as a multimodal model with pretrained and instruction-tuned variants in 1B, 13B, and 27B parameter sizes.
  • The article highlights key architectural changes for vision: a vision processor that generates outputs within a fixed token budget and a spatial 2D RoPE to encode information across height and width.
  • The discussion notes this PR likely applies to dense models only, implying separate work would be needed for Mixture-of-Experts (MoE) variants.
it looks like it will be soon ๐Ÿ’Ž๐Ÿ’Ž๐Ÿ’Ž๐Ÿ’Ž

https://github.com/ggml-org/llama.cpp/pull/21309 (thanks rerri)

from HF https://github.com/huggingface/transformers/pull/45192

[Gemma 4](INSET_PAPER_LINK) is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.

this PR probably only applies to dense, so it must be separate for MoE

submitted by /u/jacek2023
[link] [comments]