[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell

Reddit r/MachineLearning / 4/3/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • Google DeepMind released Gemma 4, including a 31B dense model and an MoE-based 26B A4B model, both supporting up to 256K context and natively multimodal inputs (text, image, video, dynamic resolution).
  • The article claims Gemma 4 can be run on NVIDIA B200 and AMD MI355X “from the same inference stack,” suggesting portability across major GPU/accelerator platforms.
  • On NVIDIA B200, the author reports about a 15% output throughput advantage over vLLM, indicating potential performance gains for high-throughput inference setups.
  • A free Modular playground is offered for users to test Gemma 4 without deploying infrastructure themselves.

Google DeepMind dropped Gemma 4 today:

Gemma 4 31B: dense, 256K context, redesigned architecture targeting efficiency and long-context quality

Gemma 4 26B A4B: MoE, 26B total / 4B active per forward pass, 256K context

Both are natively multimodal (text, image, video, dynamic resolution).

We got both running on MAX on launch day across NVIDIA B200 and AMD MI355X from the same stack. On B200 we're seeing 15% higher output throughput vs. vLLM (happy to share more on methodology if useful).

Free playground if you want to test without spinning anything up: https://www.modular.com/#playground

submitted by /u/carolinedfrasca
[link] [comments]