I tested 9 local models on the same flight sim prompt, all Q8, different Q providers, MLX

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The author benchmarked nine local LLMs using a single, repeatable flight-combat simulator prompt on an M3 Max (MLX/Claude Code) and evaluated “prompts-to-final” plus whether the model actually played the game well.
  • The study found that quantization provider/implementation can matter more than bit-width: three different 8-bit quant variants of the same Qwen3.6 35B produced meaningfully different gameplay outcomes and debugging difficulty.
  • The results showed little correlation between total output line count and quality, with the best game delivered in fewer prompts and lines than the worst case.
  • Only one model (Qwopus 3.5 27B) implemented more authentic flight physics and procedural audio, suggesting distillation/training choices can dominate over raw parameter size.
  • Overall, the experiment challenges assumptions that “8-bit is good enough” or that larger parameter counts automatically yield better in-game performance, highlighting practical differences in tooling and quant pipelines.

I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count.

All 8-bit MLX, M3 Max 128GB, served via omlx, prompted through Claude Code. Same prompt every time — single-file HTML, three selectable planes (jet, prop, wildcard of the model's choice), dynamic enemies, tracers, damage, crash spiral on loss. Counted prompts-to-final and graded on "does it actually play."

https://alextzk.github.io/flight-combat-llm-comp/ <- You can play the games here

The lineup:

  • Gemma 31B dense unsloth
  • Gemma 4 26B a4b unsloth
  • Qwen3.5 27B dense
  • Qwen3.5 35B A3B MoE
  • Qwen3.6 35B A3B in three different quants (oMLX, Unsloth, MLX Community)
  • Qwen3 Coder Next 80B
  • Qwopus 3.5 27B

Surprising findings:

1. Quant provider matters more than bit width. Three 8-bit quants of the exact same Qwen3.6 35B produced three meaningfully different games. Unsloth nailed it in 3 prompts (1,304 lines, working minimap, round planet, the model reviewed its own code for bugs before I pressed enter). MLX Community was fine in 4. oMLX was a 5-prompt debugging slog where the controls rubberbanded back to neutral and the model couldn't figure out why after three attempts. Same base model. Same 8-bit but different UX. "It's 8-bit" is not a sufficient description of a quant.

2. Line count is basically uncorrelated with quality. The winner (Qwopus 3.5 27B) shipped in 2 prompts at 1,049 lines. The loser (Qwen Coder Next 80B) shipped in 3 prompts at 1,635 lines — the most code of anyone — with over-sensitive camera, no enemies, and planes rotated 180°. The 80B sibling generated 3× the code of Gemma 31B dense and shipped a worse game.

3. Qwopus was the only model that implemented actual flight physics. Nobody asked for it. It just did it — integrated thrust/drag with per-plane aerodynamic constants, per-frame velocity damping, the F-16 accelerates differently than the Mustang because the constants are different. Also the only one that shipped procedural audio (engine frequency modulated by airspeed ratio). 2 prompts. I have to assume this is the Opus distillation doing real work, because the vanilla Qwen3.5 27B dense — same base — shipped the worst game in the lineup (control loop mixing quaternion rotations with direct Euler writes in the same frame, plane spun like a blender while falling out of the sky). The controls are far from perfect but the way it implemented it and the other extra features it built are second to none.

Web audio engine with pitch modulated by airspeed ration

function updateEngineSound(speedRatio) {
engineOsc.frequency.setValueAtTime(80 + speedRatio * 120, audioCtx.currentTime);
}

// From the F-16 config, velocity, thrust and drag
speed: 1200, turnRate: 0.015, climbRate: 0.008, thrust: 0.02, drag: 0.001,

// In the update loop
this.velocity.add(forward.multiplyScalar(this.stats.thrust * 1000 * delta));
this.velocity.multiplyScalar(1 - this.stats.drag);

Other notes worth mentioning:

- Generation speed: Gemma 4 26B a4b was the king at 58.3 tok/s, nearly 2× the Qwen A3B variants and ~7× the dense models. Qwopus generates at under 11 tok/s and still won. Per-token speed is a bad proxy for "time to working artifact."

- Qwen3.6 is a real step up over 3.5. The .1 increment packs more than usual — models reviewing their own output, trying to open the generated HTML in a browser for you. Little things, but they add up.

- The "pick a third plane" wildcard was a surprisingly good creativity probe. Qwen3.6 oMLX picked an AH-64 Apache (technically not a plane, technically the most interesting answer). Qwen Coder Next 80B, the largest model in the lineup, responded to "an option of your choosing" by shipping a third fighter jet.

- The Qwen signature bug: planes rendered 180° rotated. Showed up in most of the Qwen variants.

My personal ranking:

  1. Qwopus 3.5 27b dense
  2. Qwen3.6 35b unsloth
  3. Gemma 4 26b unsloth
  4. Gemma 4 31b unsloth
  5. Qwen3.6 35B mlx-community
  6. Qwen3.5 35b mlx-community
  7. Qwen3.6 35b oMLX oQ quant
  8. Qwen3Coder-Next 80B mlx-community
  9. Qwen3.5 27b mlx-community

If anyone is interested in a more detailed and punny writeup with per-model breakdowns, and the specific bugs and quirks of each model, there's a write-up on my Medium page, no paywall.

There are comments at the top of each HTML file in github that provide each prompt that was fed back into Claude Code and also provide ntoes.

Happy to dig into any of the specific results in comments. Two follow-ups planned — same 9 models on a 10-bug code review, and a creative task still TBD.

EDIT: added the link for the games at the top.

submitted by /u/StudentDifficult8240
[link] [comments]