it looks like it will be soon 💎💎💎💎

Reddit r/LocalLLaMA / 4/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Read original →

共有:

Key Points

A new llama.cpp pull request adds support that appears to target Hugging Face’s Gemma 4 multimodal model updates.
Gemma 4 is described as a multimodal model with pretrained and instruction-tuned variants in 1B, 13B, and 27B parameter sizes.
The article highlights key architectural changes for vision: a vision processor that generates outputs within a fixed token budget and a spatial 2D RoPE to encode information across height and width.
The discussion notes this PR likely applies to dense models only, implying separate work would be needed for Mixture-of-Experts (MoE) variants.

https://github.com/ggml-org/llama.cpp/pull/21309 (thanks rerri)

from HF https://github.com/huggingface/transformers/pull/45192

[Gemma 4](INSET_PAPER_LINK) is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.

this PR probably only applies to dense, so it must be separate for MoE

submitted by /u/jacek2023
[link] [comments]