Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

arXiv cs.CL / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper benchmarks seven recent reasoning-focused instruction-tuned LLMs (dense and MoE) across ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 using zero-shot, chain-of-thought, and few-shot chain-of-thought prompting.
  • Results show that real end-to-end accuracy–efficiency depends on the combined effects of model architecture and prompting, with Gemma-4-E4B achieving the best overall weighted accuracy (0.675) alongside relatively low VRAM (14.9 GB).
  • Although MoE models are expected to be more parameter/compute efficient, the study finds sparse activation alone does not ensure the best practical operating point, as accuracy and resource usage vary substantially by model and setting.
  • Task-level performance trends differ by family: Gemma models lead on ARC and Math, Phi models are strongest on TruthfulQA, and GSM8K is highly sensitive to prompting (including a sharp Phi-4-reasoning drop under few-shot CoT).
  • The authors release a reproducible benchmark pipeline, aggregated results, and statistical analyses intended to support deployment-oriented evaluation under realistic constraints like latency and GPU memory limits.

Abstract

Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks -- ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 -- under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.