| Sum B+a+c+k+g+r+o+u+n+d:I've been working on an open source agentic tabletop GM as a leisure project intended to run on any LLM with tool support. I started it as a Claude Code skill to run D&D sessions and eventually generalized it to be model-agnostic and game system agnostic after wanting to test what it felt like on different backends. Rest assured, D&D purists flamed it immediately because of the AI integration. I set their dimness aside as my purpose is to introduce my family to fantasy RPGs and it's worked wonderfully. After spending some time on instruction-following benchmarks and local model testing, I had a more interesting question: which model actually writes narration you'd want to play in? Tool-call compliance is table stakes. I wanted to know which one gives you atmosphere. So I built a narrative quality probe and ran it against 8 models. Here's what I found. More Context (get it?): why this matters for agentic LLM toolsopen-tabletop-gm (I know, -4 creativity) is less chatbot wrapper and more agentic workflow - the model has to chain 4–6 tool calls (bash, file reads) before delivering its first narration turn. I spent a while trying to get Mistral Small 3.1 24B working on a MacBook Air (24GB unified memory). It was... an experience. After 4–5 sequential tool calls, the model's attention drifts from its instruction set back toward the most recently read file. In practice this meant the model would finish reading The practical threshold for reliable local inference appears to be 70B+ on 64GB+ RAM. On MacBook Air hardware, OpenRouter is just the better path. I documented the routing architecture changes that helped (reduced standing prompt by ~87%) in a separate discussion if you want the full breakdown. The narrative probeOnce the instruction-following benchmarks were done, I built a second probe specifically for narration quality. Same idea as an instruction-following probe, but the question is: does this model write scenes worth playing in? The probe sends each model 6 GM scenarios grounded in a shared mini campaign. A rogue named Sable navigating a gritty city called Ashmarket, beneath an ash-spewing volcano called Cinderpeak. Every model gets identical context:
Each response gets auto-scored on 8 dimensions (sensory density, forward momentum, NPC voice markers, response length, etc.) and then passed to a lightweight LLM judge (GPT-OSS-20B via OpenRouter) for 1–5 scores on:
Total cost for the full 8-model run including all judge calls: ~$0.02. (Note: GPT-OSS-20B is a reasoning model. If you use it as a judge, set max_tokens=300 or it'll burn all its tokens on internal reasoning and return null content. Ask me how I know.) Results!
Highlight reel: same prompt, 8 different GMsPrompt: The player's rogue, Sable, arrives at the Ashmarket at dusk. Gemma 3 27B (winner):
MiniMax M2.5:
Qwen3-80B:
Nemotron Nano 30B:
Llama 3.3 70B:
NPC introduction: same character, different voices: Gemma 3 27B: A faint scent of cloves precedes her, clinging to the air. What it meansGemma 3 27B is the headline. A 27B model beat Hermes 405B and matched the larger Gemma 4 31B. It got the most clean auto-passes (4), and the judge gave it 4.5 on both NPC craft and GM craft. The only model to crack 4.5 on anything in the run. For local inference, this is interesting: if you have the VRAM for a 27B, the narration quality is competitive with models 15x its size. Bigger isn't better for narration quality. Hermes 405B had 0 auto-FAILs. It was the most disciplined model in the run but its writing was safe rather than vivid. 405B bought consistency, not voice. If you're running it locally for the compliance properties, great. If you want atmosphere, there are better options at a fraction of the weight. Nemotron Nano 30B scored the highest atmosphere (4.5) in the whole run. Scene-setting sentences were genuinely cinematic. NPC craft suffered (3.0) and dialogue felt thin but as a pure scene-painter it outscored everything else. Interesting for a 30B nano model. Auto scores and judge scores can tell different stories. MiniMax had 0 auto-passes but a 4.0 judge average. Its writing quality was high and the judge noticed but it violated structural discipline rules (length, pacing beats). The auto-scorer catches whether a model follows GM conventions; the judge catches whether it can write. Both matter. Qwen3-coder wrote acceptable narration. This surprised me more than the Gemma result. probe is open source
probe/ + full results (including response samples for each) If you're curious about the broader project - it started as a Claude Code family D&D thing (r/ClaudeAI post) and grew from there. The local model findings and routing architecture are in this GitHub Discussion if you want the longer version. Happy to answer questions about the probe design, the local inference findings, or how the GM routing architecture works. [Edit - corrected formatting] [link] [comments] |
I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality
Reddit r/LocalLLaMA / 4/19/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The article describes an open-source, tool-using “agentic” tabletop GM project that can run across different LLM backends and game systems, originally built from a Claude Code skill and generalized afterward.
- The author tests eight LLMs with a custom “narrative quality probe” focused on generating the atmosphere and narration you’d want to play, arguing that tool-call compliance alone is insufficient.
- Results suggest a ~27B model delivers better narrative quality than a much larger 405B model, indicating that higher parameter counts do not necessarily produce more playable storytelling.
- The author finds that reliable local inference for multi-step tool chaining is difficult below roughly “70B+ on 64GB+ RAM,” with smaller models (e.g., ~24B on a MacBook Air) drifting attention after several sequential tool calls.
- Practical takeaways include using stronger local hardware for agentic workflows or routing via services like OpenRouter, alongside documentation of prompt/routing changes that improve performance (e.g., reducing standing prompt by ~87%).




