| I've been working on a project I call Guess Llama. The concept is based on the old 'Guess Who?' game. 'Guess Llama' uses a vision LLM backend such as llama.cpp's llama-server to generate and play the game. It currently uses stable-diffusion.cpp's sd-server or Openrouter.ai image generating models to generate the images.
The LLM backend actually looks at the images when deciding elimination questions, and looks at its own image when answering the player's elimination question. Qwen3.5 has been doing great at playing the game. I'm surprised I pulled a win for the example video without cheating. When Qwen3.5 asked me about my capybara's red bandanna I thought it was going to be over. A smaller Gemma4 seemed to get a bit confused, but I didn't test them extensively. ie. One eliminated my character erroneously despite me answering its question correctly. I've been using Z-Image-Turbo for local images. It's actually a benefit if the image model has difficulty making the same character twice. We want variation. With thinking/reasoning it can take a long time for the bot to generate a response. Even using OpenRouter as a backend to speed up testing takes a while. The context used is around 6.2K tokens when 23 512x512 images are presented to the bot.
Using openrouter's This seemed like the lowest hanging fruit for a vision based LLM game. [link] [comments] |
Guess Llama - A game for local Vision LLM
Reddit r/LocalLLaMA / 4/11/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage
Key Points
- Guess Llama is a local “Guess Who?”-style game that combines a vision LLM backend (e.g., llama.cpp’s llama-server) with an image generator (stable-diffusion.cpp sd-server, or OpenRouter image models) to create theme-based character sets.
- The workflow generates 24 character images per theme and assigns both the player and the bot a random character; the LLM then plays by asking and answering yes/no elimination questions based on visual input.
- The developer notes that the vision LLM can both generate elimination-question logic from the images and answer the player’s questions using its own image, enabling the game loop without direct manual labeling.
- Early results suggest stronger performance with Qwen3.5, while a smaller Gemma4 variant sometimes makes incorrect eliminations; the author also emphasizes the need for image variation to avoid repeated characters.
- Response latency can be long when using reasoning/thinking, since image generation and multi-step reasoning (even with remote backends like OpenRouter) add significant delay.

