I created an LLM benchmark and I still can't believe how good Qwen3.5-122b performed

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • A developer describes building a custom, months-long benchmark for LLMs using a text-based strategy game where models repeatedly plan, produce, trade, and adapt under heavy incoming damage.
  • The benchmark includes an explicit memory/reflection loop: each LLM reformulates prompts based on observed outcomes and this self-critique/re-adaptation occurs about 20 times per game.
  • The author reports standout performance from Qwen3.5-122B when run with AWQ 4-bit quantization, stating they were surprised by how well it performs.
  • To improve fairness across vendors, the benchmark attempts to equalize compute/“reasoning time” so models with different default runtimes are compared under similar generation budgets.
  • The post notes concerns about brute-force behavior (e.g., “spawning a parliament of noise”) as a drawback of giving some models disproportionately long reasoning/output generation.

I've been working for 2 months on this game, literally all my time on it (the last time I went out of the apartment was on March 1st).
It's a text-based strategy game with the most massive amount of incoming damage on both LLM sides. Each controls 4 small "countries" and one is Sovereign (most important). The LLMs decide what to build, what to train, what to produce, what to trade, what to cast, what is most important. There is a memory system, where they self-form a new prompt, after examining the damage done to them, as well as what they inflicted upon the enemy, it truly measures if they're able to self-criticize and quickly change/adapt. This reflection happens over 20 times for each LLM per game.
You can read more about it on the website, there are detailed match reports.
As a last mention, I honestly can't get over how good Qwen3.5 122b is (used here at AWQ 4bit quant).... Just... WOW.
Thank you for reading!
https://dominionrift.ai

PS - Before you ask, the last two matches are being played right now and the full scores will be up soon.
I'm very tired and probably missing a lot of points like, I focused on each LLM having roughly 60 seconds of reasoning time, because initially, I noticed that at the same reasoning level, different LLM vendors will take 3-4-sometimes 5x the amount of time to generate an answer. I started on high for all, and chatGPT5.4 took over 10 minutes per turns while Opus was sub 2 minute and that didn't seem fair. A big part was figuring out how to make them compute roughly the same amount.
Spawning a parliament of noise just for a few hundred output tokens doesn't seem intelligent, it seems a lot more like brute forcing.

submitted by /u/UltrMgns
[link] [comments]