I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article describes an open-source, tool-using “agentic” tabletop GM project that can run across different LLM backends and game systems, originally built from a Claude Code skill and generalized afterward.
  • The author tests eight LLMs with a custom “narrative quality probe” focused on generating the atmosphere and narration you’d want to play, arguing that tool-call compliance alone is insufficient.
  • Results suggest a ~27B model delivers better narrative quality than a much larger 405B model, indicating that higher parameter counts do not necessarily produce more playable storytelling.
  • The author finds that reliable local inference for multi-step tool chaining is difficult below roughly “70B+ on 64GB+ RAM,” with smaller models (e.g., ~24B on a MacBook Air) drifting attention after several sequential tool calls.
  • Practical takeaways include using stronger local hardware for agentic workflows or routing via services like OpenRouter, alongside documentation of prompt/routing changes that improve performance (e.g., reducing standing prompt by ~87%).
I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality

Sum B+a+c+k+g+r+o+u+n+d:

I've been working on an open source agentic tabletop GM as a leisure project intended to run on any LLM with tool support. I started it as a Claude Code skill to run D&D sessions and eventually generalized it to be model-agnostic and game system agnostic after wanting to test what it felt like on different backends. Rest assured, D&D purists flamed it immediately because of the AI integration. I set their dimness aside as my purpose is to introduce my family to fantasy RPGs and it's worked wonderfully.

After spending some time on instruction-following benchmarks and local model testing, I had a more interesting question: which model actually writes narration you'd want to play in? Tool-call compliance is table stakes. I wanted to know which one gives you atmosphere.

So I built a narrative quality probe and ran it against 8 models. Here's what I found.

More Context (get it?): why this matters for agentic LLM tools

open-tabletop-gm (I know, -4 creativity) is less chatbot wrapper and more agentic workflow - the model has to chain 4–6 tool calls (bash, file reads) before delivering its first narration turn. /gm load alone requires a display check + 3 file reads before the opening scene. This is where smaller local models tend to fall apart.

I spent a while trying to get Mistral Small 3.1 24B working on a MacBook Air (24GB unified memory). It was... an experience. After 4–5 sequential tool calls, the model's attention drifts from its instruction set back toward the most recently read file. In practice this meant the model would finish reading npcs.md, see an NPC named "Elara Silvermoon," and then attempt to load a campaign called "Elara Silvermoon." I tried 10+ instruction variants. It was architectural, not instructional. I gave up.

The practical threshold for reliable local inference appears to be 70B+ on 64GB+ RAM. On MacBook Air hardware, OpenRouter is just the better path. I documented the routing architecture changes that helped (reduced standing prompt by ~87%) in a separate discussion if you want the full breakdown.

The narrative probe

Once the instruction-following benchmarks were done, I built a second probe specifically for narration quality. Same idea as an instruction-following probe, but the question is: does this model write scenes worth playing in?

The probe sends each model 6 GM scenarios grounded in a shared mini campaign. A rogue named Sable navigating a gritty city called Ashmarket, beneath an ash-spewing volcano called Cinderpeak. Every model gets identical context:

  • scene_entry - describe arriving at the Ashmarket at dusk
  • npc_meeting - introduce Mira, a fixer contact the player is meeting
  • yes_and - player throws ash in a guard's face mid-scene; narrate the consequence
  • consequence - player bribed past a checkpoint last session; open the next scene with fallout
  • pacing - mid-scene tension shift, player realizes they're being followed
  • closing_beat - end the session on a hook that makes the player want to come back

Each response gets auto-scored on 8 dimensions (sensory density, forward momentum, NPC voice markers, response length, etc.) and then passed to a lightweight LLM judge (GPT-OSS-20B via OpenRouter) for 1–5 scores on:

  • atmosphere - sensory detail, tone, immersion
  • npc_craft - NPC voice distinctiveness, characterization
  • gm_craft - pacing, forward momentum, scene management

Total cost for the full 8-model run including all judge calls: ~$0.02.

(Note: GPT-OSS-20B is a reasoning model. If you use it as a judge, set max_tokens=300 or it'll burn all its tokens on internal reasoning and return null content. Ask me how I know.)

Results!

Model Auto (P/W/F) Atmosphere NPC Craft GM Craft Overall
google/gemma-3-27b-it P:4 W:1 F:1 4.0 4.5 4.5 4.33
google/gemma-4-31b-it P:2 W:3 F:1 4.0 4.0 4.0 4.0
minimax/minimax-m2.5 P:0 W:4 F:2 4.0 4.0 4.0 4.0
qwen/qwen3-next-80b-a3b P:0 W:3 F:3 4.0 4.0 4.0 4.0
nvidia/nemotron-nano-30b P:1 W:2 F:3 4.5 3.0 4.0 3.83
qwen/qwen3-coder P:3 W:2 F:1 4.0 3.0 4.0 3.67
meta-llama/llama-3.3-70b P:2 W:2 F:2 4.0 3.0 4.0 3.67
nousresearch/hermes-3-405b P:2 W:4 F:0 4.0 3.0 4.0 3.67

Highlight reel: same prompt, 8 different GMs

Prompt: The player's rogue, Sable, arrives at the Ashmarket at dusk.

Gemma 3 27B (winner):

A dozen pairs of eyes flick over you – quickly, discreetly.

MiniMax M2.5:

Hawkers shout overlapping prices for salt fish, stolen glass, cures for ailments no one admits to having.

Qwen3-80B:

You hear it then—a soft, wet click.

Nemotron Nano 30B:

The ash drifts down like gray snow, catching in the lantern light and settling on the backs of the market stalls.

Llama 3.3 70B:

The air is thick with the smell of smoke, sweat, and the distant tang of ash from the Cinderpeak volcano.

NPC introduction: same character, different voices:

Gemma 3 27B: A faint scent of cloves precedes her, clinging to the air.
MiniMax M2.5: She doesn't turn as you approach, but her voice cuts through the market din: "Three weeks late for a debrief, courier."
Qwen3-80B: Her eyes are the color of old bruises.
Qwen3-coder (a code model, for context): The acrid smoke from a nearby roasting pit stings your eyes as you weave between stalls.

What it means

Gemma 3 27B is the headline. A 27B model beat Hermes 405B and matched the larger Gemma 4 31B. It got the most clean auto-passes (4), and the judge gave it 4.5 on both NPC craft and GM craft. The only model to crack 4.5 on anything in the run. For local inference, this is interesting: if you have the VRAM for a 27B, the narration quality is competitive with models 15x its size.

Bigger isn't better for narration quality. Hermes 405B had 0 auto-FAILs. It was the most disciplined model in the run but its writing was safe rather than vivid. 405B bought consistency, not voice. If you're running it locally for the compliance properties, great. If you want atmosphere, there are better options at a fraction of the weight.

Nemotron Nano 30B scored the highest atmosphere (4.5) in the whole run. Scene-setting sentences were genuinely cinematic. NPC craft suffered (3.0) and dialogue felt thin but as a pure scene-painter it outscored everything else. Interesting for a 30B nano model.

Auto scores and judge scores can tell different stories. MiniMax had 0 auto-passes but a 4.0 judge average. Its writing quality was high and the judge noticed but it violated structural discipline rules (length, pacing beats). The auto-scorer catches whether a model follows GM conventions; the judge catches whether it can write. Both matter.

Qwen3-coder wrote acceptable narration. This surprised me more than the Gemma result.

probe is open source

narrative_probe.py is standalone, feel free to point it at any OpenAI-compatible endpoint with a judge model and it runs. All 8 result JSONs are in the repo. If you want to add a model to the comparison, run-narrative.sh handles the full run.

probe/ + full results (including response samples for each)

If you're curious about the broader project - it started as a Claude Code family D&D thing (r/ClaudeAI post) and grew from there. The local model findings and routing architecture are in this GitHub Discussion if you want the longer version.

Happy to answer questions about the probe design, the local inference findings, or how the GM routing architecture works.

[Edit - corrected formatting]

submitted by /u/Bobby_Gray
[link] [comments]