Guess Llama - A game for local Vision LLM

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • Guess Llama is a local “Guess Who?”-style game that combines a vision LLM backend (e.g., llama.cpp’s llama-server) with an image generator (stable-diffusion.cpp sd-server, or OpenRouter image models) to create theme-based character sets.
  • The workflow generates 24 character images per theme and assigns both the player and the bot a random character; the LLM then plays by asking and answering yes/no elimination questions based on visual input.
  • The developer notes that the vision LLM can both generate elimination-question logic from the images and answer the player’s questions using its own image, enabling the game loop without direct manual labeling.
  • Early results suggest stronger performance with Qwen3.5, while a smaller Gemma4 variant sometimes makes incorrect eliminations; the author also emphasizes the need for image variation to avoid repeated characters.
  • Response latency can be long when using reasoning/thinking, since image generation and multi-step reasoning (even with remote backends like OpenRouter) add significant delay.
Guess Llama - A game for local Vision LLM

I've been working on a project I call Guess Llama.

The concept is based on the old 'Guess Who?' game.

'Guess Llama' uses a vision LLM backend such as llama.cpp's llama-server to generate and play the game. It currently uses stable-diffusion.cpp's sd-server or Openrouter.ai image generating models to generate the images.

  1. You can enter any 'theme' for the game, or ask the bot to generate one. Such as 'cat', 'llama', 'capybara', 'clown', 'space alien', etc.
  2. The bot suggests 8 items that can go with the theme. (For image variation)
  3. The image server then generates 24 character images with that theme and 2 of the items for each character.
  4. You and the bot are assigned a random character from that set.
  5. You and the bot ask each other yes/no questions until one of you narrow it down to one possible character and win.

The LLM backend actually looks at the images when deciding elimination questions, and looks at its own image when answering the player's elimination question.

Qwen3.5 has been doing great at playing the game. I'm surprised I pulled a win for the example video without cheating. When Qwen3.5 asked me about my capybara's red bandanna I thought it was going to be over.

A smaller Gemma4 seemed to get a bit confused, but I didn't test them extensively. ie. One eliminated my character erroneously despite me answering its question correctly.

I've been using Z-Image-Turbo for local images. It's actually a benefit if the image model has difficulty making the same character twice. We want variation.

With thinking/reasoning it can take a long time for the bot to generate a response. Even using OpenRouter as a backend to speed up testing takes a while.

The context used is around 6.2K tokens when 23 512x512 images are presented to the bot.

  • Only tested on llama-server & openrouter. Other backends like LMStudio should work.
  • Only tested on Linux. The github workflows say it should compile on MacOS & Windows.
  • Can potentially add other image backends. stable-diffusion.cpp & openrouter seemed like the easiest to implement.
  • You can use the supplied 'Cat' theme if you don't want to wait for images to generate to test this.
  • Primarily tested with Qwen3.5, but any vision model that can take in an arbitrary number of images (23) should be able to play.
  • There's no prompt caching, it's processing the tokens every time.

Using openrouter's black-forest-labs/flux.2-klein-4b to generate images currently costs about $0.017 per image, if you don't want to generate them locally. Roughly $0.41 per image set. If you play against openrouter's qwen/qwen3.5-122b-a10b then it can cost up to $0.02 per interaction. (Each round has multiple interactions, generating a question, eliminating the characters based on the answer, etc.)

This seemed like the lowest hanging fruit for a vision based LLM game.

submitted by /u/SM8085
[link] [comments]