I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Dev.to / 3/23/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The benchmark tasks LLMs with writing code to control RTS game units, where only move() and pew() actions are available, emphasizing strategic reasoning in a 9v9 setting.
The testing method includes a baseline phase of 10 rounds against a human-coded bot, followed by a 10-game round-robin tournament with iterative improvements and reviews that include ASCII game-state snapshots and model-generated logs.
Gemini 3.1 emerges as the best performer on this benchmark, with results and replays hosted at the linked arena page.
The evaluation approach focuses on reproducibility and interpretability through code generation, gameplay, and console/ASCII-state logging to compare model behavior.

Link to the results and additional details: https://yare.io/ai-arena

The game is fairly simple. 9 vs. 9 units battling each other on a basic map. The only actions the units can do are move() and pew(). All of the complexity emerges from having to reason about where to move, and whom to pew.

Testing method

Every LLM first creates their 'baseline' bot by playing 10 rounds against a human-coded bot of decent strength. A round consists of:

write code based on the game's documentation
play a game (models are allowed to add console.log() for whatever they think is important to track
get a review of the finished game (ASCII representation of the game state at key moments + the logs they themselves coded in.

Once their baseline bot is ready, they play a 10-games round-robin tournament with each other with the same iterative loop (improving their bot every game).

The results

Gemini 3.1 is by far the best at this specific benchmark/game. See the replays and additional details at https://yare.io/ai-arena

Black Hat USA

AI Business

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more

Reddit r/LocalLLaMA

Revolutionizing Hotel Front Desk with AI

Dev.to

Apple Silicon as a Serious AI Dev Box: What an M4 Max Actually Does With a 70B Model

Dev.to

LLM planner - pick a rig for your use-case/model/budget, or pick models for your rig. 60+ builds, 50+ models, 130+ cited t/s sources, 150+ reviewer YouTube videos, idle+active watts, multi-region prices, regular updates.

Reddit r/LocalLLaMA

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.

Key Points

Testing method

The results

Related Articles

Black Hat USA

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more

Revolutionizing Hotel Front Desk with AI

Apple Silicon as a Serious AI Dev Box: What an M4 Max Actually Does With a 70B Model

LLM planner - pick a rig for your use-case/model/budget, or pick models for your rig. 60+ builds, 50+ models, 130+ cited t/s sources, 150+ reviewer YouTube videos, idle+active watts, multi-region prices, regular updates.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer