Built a controllable computer-use VLM harness for Civilization VI (voice & natural language strategy → UI actions)

Reddit r/LocalLLaMA / 3/31/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author built “civStation,” an open-source, controllable vision-language model (VLM) harness that plays Civilization VI by translating voice or natural-language strategy into concrete UI actions via mouse/keyboard.
  • The system is designed as a strategy-level loop—screen observation, strategy interpretation, action planning, execution—rather than a low-level “click replication” demo.
  • It supports human-in-the-loop override/guidance and mentions MCP/skill extensibility, enabling live interruption and modular capability expansion.
  • The project emphasizes shifting the interaction layer upward (intent expression and controllable delegation) and raises questions about the optimal boundary between strategy and execution, as well as robustness and latency tradeoffs.
  • The author positions civStation as a testbed for broader questions about whether this approach can generalize beyond games to desktop workflows, and provides the repository for experimentation.
Built a controllable computer-use VLM harness for Civilization VI (voice & natural language strategy → UI actions)

I built civStation, an open-source, controllable computer-use stack / VLM harness for Civilization VI.

The goal was not just to make an agent play Civ6, but to build a loop where the model can observe the game screen, interpret high-level strategy, plan actions, execute them through mouse and keyboard, and be interrupted or guided live through human-in-the-loop (HitL) or MCP.

Instead of treating Civ6 as a low-level UI automation problem, I wanted to explore strategy-level control.

You can give inputs like:
“expand to the east”
“focus on economy this turn”
“aim for a science victory”

and the system translates that intent into actual in-game actions.

At a high level, the loop looks like this:

screen observation → strategy interpretation → action planning → execution → human override

This felt more interesting than just replicating human clicks, because it shifts the interface upward — from direct execution to intent expression and controllable delegation.

Most computer-use demos focus on “watch the model click.”

I wanted something closer to a controllable runtime where you can operate at the level of strategy instead of raw UI interaction.

Another motivation was that a lot of game UX is still fundamentally shaped by mouse, keyboard, and controller constraints. That doesn’t just affect control schemes, but also the kinds of interactions we even imagine.

I wanted to test whether voice and natural language, combined with computer-use, could open a different interaction layer — where the player behaves more like a strategist giving directives rather than directly executing actions.

Right now the project includes live desktop observation, real UI interaction on the host machine, a runtime control interface, human-in-the-loop control, MCP/skill extensibility, and natural language or voice-driven control.

Some questions I’m exploring:

Where should the boundary be between strategy and execution?
How controllable can a computer-use agent be before the loop becomes too slow or brittle?
Does this approach make sense only for games, or also for broader desktop workflows?

Repo: https://github.com/NomaDamas/civStation.git

submitted by /u/Working_Original9624
[link] [comments]