Thought I'd share some benchmark numbers from my local setup.
Hardware: Dual NVIDIA RTX 3090s Model: Gemma 4 (MoE architecture) Performance: ~120 Tokens Per Second
The efficiency of this MoE implementation is unreal. Even with a heavy load, the throughput stays incredibly consistent. It's a massive upgrade for anyone running local LLMs for high-frequency tasks or complex agentic workflows.
The speed allows for near-instantaneous reasoning, which is a total paradigm shift compared to older dense models. If you have the VRAM to spare, this is definitely the way to go.
[link] [comments]




