Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
arXiv cs.CL / 4/9/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper benchmarks seven recent reasoning-focused instruction-tuned LLMs (dense and MoE) across ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 using zero-shot, chain-of-thought, and few-shot chain-of-thought prompting.
- Results show that real end-to-end accuracy–efficiency depends on the combined effects of model architecture and prompting, with Gemma-4-E4B achieving the best overall weighted accuracy (0.675) alongside relatively low VRAM (14.9 GB).
- Although MoE models are expected to be more parameter/compute efficient, the study finds sparse activation alone does not ensure the best practical operating point, as accuracy and resource usage vary substantially by model and setting.
- Task-level performance trends differ by family: Gemma models lead on ARC and Math, Phi models are strongest on TruthfulQA, and GSM8K is highly sensitive to prompting (including a sharp Phi-4-reasoning drop under few-shot CoT).
- The authors release a reproducible benchmark pipeline, aggregated results, and statistical analyses intended to support deployment-oriented evaluation under realistic constraints like latency and GPU memory limits.
Related Articles

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to

Moving from proof of concept to production: what we learned with Nometria
Dev.to

Frontend Engineers Are Becoming AI Trainers
Dev.to