Turns out Gemma 4 had MTP (multi token prediction) all along

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • A developer integrating Gemma 4 via the LiteRT API observed runtime errors related to “mtp weights being an incompatible tensor shape” on a Google Pixel 9 device.
  • Investigation suggested Gemma 4’s LiteRT package includes additional multi-token prediction (MTP) heads intended for speculative decoding and faster text generation.
  • The post claims a Google employee confirmed that Gemma 4 does have MTP, but it was “removed on purpose” to improve compatibility and broad usability across deployments.
  • The author speculates that reverse engineering LiteRT tensors and the compute graph might enable users to recover/repurpose the MTP functionality for faster outputs.
Turns out Gemma 4 had MTP (multi token prediction) all along

Hey Everyone, While I was trying to utilize Gemma 4 through the LiteRT api in my android app, I noticed that Gemma 4 was throwing errors when loading it on my Google Pixel 9 test device of the "mtp weights being an incompatible tensor shape". I did some digging and found out there's additional MTP prediction heads within the LiteRT files for speculative decoding and much faster outputs.

Well turns out I got confirmation today from a Google employee that Gemma 4 DOES INDEED have MTP but it was "removed on purpose" for "ensuring compatibility and broad usability".

Well would've been great to be honest if they released the full model instead, considering we already didn't get the Gemma 124B model leaked in Jeff Dean's tweet by accident. Would've been great to have much faster Gemma 4 generation outputs, ideally on the already fast MoE. Maybe someone can reverse engineer and extract the tensors and the math based on the compute graph in LiteRT?

Here's a link to the conversation:

https://huggingface.co/google/gemma-4-E4B-it/discussions/5

submitted by /u/Electrical-Monitor27
[link] [comments]