Gemma 4 MTP released

Reddit r/LocalLLaMA / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Google has released Gemma 4 Multi-Token Prediction (MTP) draft models that extend Gemma 4 with a smaller, faster draft model for speculative decoding.
  • In an MTP speculative decoding pipeline, the draft model predicts multiple tokens ahead and the target model verifies them in parallel.
  • The approach provides up to 2× decoding speedups while preserving the exact same output quality as standard generation.
  • The released Hugging Face checkpoints include multiple sizes for the Gemma 4 family (e.g., 31B, 26B, E4B, E2B), positioned for low-latency and on-device deployments.
  • These MTP artifacts are provided specifically for MTP “drafters” via model cards, indicating they are intended to be plugged into speculative decoding systems rather than used as standalone chat models.

Blog post:

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

MTP draft models:

https://huggingface.co/google/gemma-4-31B-it-assistant

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant

https://huggingface.co/google/gemma-4-E4B-it-assistant

https://huggingface.co/google/gemma-4-E2B-it-assistant

This model card is for the Multi-Token Prediction (MTP) drafters for the Gemma 4 models. MTP is implemented by extending the base model with a smaller, faster draft model. When used in a Speculative Decoding pipeline, the draft model predicts several tokens ahead, which the target model then verifies in parallel. This results in significant decoding speedups (up to 2x) while guaranteeing the exact same quality as standard generation, making these checkpoints perfect for low-latency and on-device applications.

submitted by /u/rerri
[link] [comments]