civStation - a VLM system for playing Civilization VI via strategy-level natural language

Reddit r/LocalLLaMA / 3/31/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

civStation is an experimental “computer-use” VLM system that plays Civilization VI by translating strategy-level natural-language instructions (e.g., “focus on economy” or “aim for a science victory”) into concrete in-game actions.
The system uses a three-layer design—Strategy (intent/goal planning and decomposition), Action (VLM-based screen interpretation plus mouse/keyboard execution without a game API), and HITL (human-in-the-loop overrides for real-time control).
Rather than relying on a single action sequence, it plans one strategy and then generates multiple possible action sequences per task, typically requiring about 2–16 model calls.
Execution is implemented via sub-agents for bounded gameplay tasks (such as city management or unit control), and the project emphasizes shifting interaction from “action → intent” toward delegation and agent orchestration.
Key challenges highlighted include VLM perception errors, execution drift across multi-step play, and limited verification reliability, alongside latency/API-cost trade-offs from multi-step calling and fallback behaviors.
The project’s central goal is not only automated gameplay, but also improving the human–system interface by enabling strategy-level control in UI-only environments.

A computer-use VLM harness that plays Civilization VI via natural language commands
High-level intents like
- “expand to the east”,
- “focus on economy”,
- “aim for a science victory” → translated into actual in-game actions
3-layer architecture separating strategy and execution (Strategy / Action / HITL)
- Strategy Layer: converts natural language → structured goals, maintains long-term direction, performs task decomposition
- Action Layer: screen-based (VLM) state interpretation + mouse/keyboard execution (no game API)
- HITL Layer: enables real-time intervention, override, and controllable autonomy
One strategy → multiple action sequences, with ~2–16 model calls per task
Sub-agent based execution for bounded tasks (e.g., city management, unit control)
Explores shifting interfaces from “action → intent” instead of RL/IL/scripted approaches
Moves from direct manipulation to delegation and agent orchestration
Key technical challenges:
- VLM perception errors,
- execution drift,
- lack of reliable verification
Multi-step execution introduces latency and API cost trade-offs, fallback strategies degrade
Not fully autonomous: supports human-in-the-loop for real-time strategy correction and control
Experimental system tackling agent control and verification in UI-only environments
Focus is not just gameplay, but elevating the human-system interface to the strategy level