AI Navigate

DoomVLM is now Open Source - VLM models playing Doom

Reddit r/LocalLLaMA / 3/12/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • DoomVLM is now open source under the MIT license, enabling vision-language models to play Doom via ViZDoom with no reinforcement learning or fine-tuning—pure vision inference.
  • The release adds two deathmatch modes (Benchmark and Arena) that support up to four agents on the same map, with a configurable UI for prompts, tools, sampling, and history.
  • It supports any OpenAI-compatible API (LM Studio, Ollama, vLLM, OpenAI, Claude) and enables cross-model comparisons such as Qwen-3.5-0.8b vs GPT-4o.
  • Gameplay is recorded (GIF/MP4 with overlays) and logged in workspace/, with a live scoreboard in Jupyter and results downloadable as a ZIP; performance notes cite ~10s/step on a MacBook M1 Pro 16GB for 0.8B and ~0.5s on RunPod L40S, with GPU recommended for arena.
  • The project is a single Jupyter notebook and MIT-licensed, with quick-start steps provided to get up and running.
DoomVLM is now Open Source - VLM models playing Doom

A couple days ago I posted a video of Qwen 3.5 0.8B playing Doom here (https://www.reddit.com/r/LocalLLaMA/comments/1rpq51l/) — it blew up way more than I expected, and a lot of people asked me to open source it. Here it is: https://github.com/Felliks/DoomVLM

Since then I've reworked things pretty heavily. The big addition is deathmatch — you can now pit up to 4 models against each other on the same map and see who wins.

Quick reminder how it works: the notebook takes a screenshot from ViZDoom, draws a numbered column grid on top, sends it to a VLM via any OpenAI-compatible API. The model has two tools — shoot(column) and move(direction), with tool_choice: "required". No RL, no fine-tuning, pure vision inference.

What's new:

Two deathmatch modes. Benchmark — models take turns playing against bots under identical conditions, fair comparison. Arena — everyone in the same game simultaneously via multiprocessing, whoever inferences faster gets more turns.

Up to 4 agents, each fully configurable right in the UI — system prompt, tool descriptions, sampling parameters, message history length, grid columns, etc. You can put 0.8B against 4B against 9B and see the difference. Or Qwen vs GPT-4o if you feel like it.

Works with any OpenAI-compatible API — LM Studio, Ollama, vLLM, OpenRouter, OpenAI, Claude. Just swap the URL and model in the settings.

Episode recording in GIF/MP4 with overlays — you can see HP, ammo, what the model decided, latency. Live scoreboard right in Jupyter. All results are saved to the workspace/ folder — logs, videos, screenshots. At the end you can download everything as a single ZIP.

Performance: on my MacBook M1 Pro 16GB the 0.8B model takes ~10 seconds per step. Threw it on a RunPod L40S — 0.5 seconds. You need a GPU for proper arena gameplay.

Quick start: LM Studio → lms get qwen-3.5-0.8b → lms server start → pip install -r requirements.txt → jupyter lab doom_vlm.ipynb → Run All

The whole project is a single Jupyter notebook, MIT license.

On prompts and current state: I haven't found universal prompts that would let Qwen 3.5 consistently beat every scenario. General observation — the simpler and shorter the prompt, the better the results. The model starts to choke when you give it overly detailed instructions.

I haven't tested flagships like GPT-4o or Claude yet — though the interface supports it, you can run them straight from your local machine with no GPU, just plug in the API key. If anyone tries — would love to see how they compare.

I've basically just finished polishing the tool itself and am only now starting to explore which combinations of models, prompts and settings work best where. So if anyone gives it a spin — share your findings: interesting prompts, surprising results with different models, settings that helped. Would love to build up some collective knowledge on which VLMs actually survive in Doom. Post your gameplay videos — they're in workspace/ after each run (GIF/MP4 if you enabled recording).

submitted by /u/MrFelliks
[link] [comments]