| You can play them here: https://fatheredpuma81.github.io/LLM_Racing_Games/ This started out as a simple test for Qwen3 Coder Next vs Qwen3.5 4B because they have similar benchmark numbers and then I just kept trying other models and decided I might as well share it even if I'm not that happy with how I did it. Read the "How this works" in the top right if you want to know how it was but the TLDR is: Disabled vision, sent same initial prompt in Plan mode, enabled Playwright MCP and sent the same start prompt, and then spent 3 turns testing the games and pointing out what issues I saw to the LLMs. There's a ton of things I'd do differently if I ever got around to redoing this. Keeping and showing all 4 versions of the HTML for 1, not disabling Vision which hindered Qwen 27B a ton (it was only disabled for an apples to apples comparison between 4B and Coder), and idk I had a bunch more thoughts on it but I'm too tired to remember them. Some interesting notes:
[link] [comments] |
(Interactive)OpenCode Racing Game Comparison Qwen3.6 35B vs Qwen3.5 122B vs Qwen3.5 27B vs Qwen3.5 4B vs Gemma 4 31B vs Gemma 4 26B vs Qwen3 Coder Next vs GLM 4.7 Flash
Reddit r/LocalLLaMA / 4/21/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical Usage
Key Points
- The interactive page lets users play a set of LLM-driven “racing game” implementations produced by different models, comparing how they generate and modify the game code.
- The creator explains the methodology: vision is disabled, the same initial prompt is used in Plan mode, Playwright MCP is enabled to run/play the game, and multiple turns of prompting are used to surface issues to the models.
- Notable behavior differences include Qwen3 Coder Next seemingly using invisible-wall tracks, Gemma 4 31B and Qwen3.5 27B outputting full code each turn, and Qwen3.5 27B accidentally succeeding on the last turn due to disabling Playwright MCP.
- Other observations highlight unique features per model, such as Gemma 4 26B adding sound and spawning a subagent, and GLM 4.7 Flash using a subagent during planning.
- The write-up also mentions limitations and “what I would do differently,” including not disabling vision and preserving/showing all HTML versions for better reproducibility and comparison.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

A practical guide to getting comfortable with AI coding tools
Dev.to

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to