OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

arXiv cs.CV / 4/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces OccDirector, a generative framework that creates 4D occupancy dynamics for autonomous driving simulation using only natural-language conditioning, avoiding rigid geometric inputs like explicit trajectories.
  • OccDirector is designed as a “scenario director,” translating language scripts into physically plausible voxel-based spatiotemporal behavior while bridging a gap between semantics and spatiotemporal structure.
  • The method uses a VLM-driven Spatio-Temporal MMDiT with a history-prefix anchoring strategy to maintain consistent multi-agent interactions over long horizons.
  • The authors release OccInteract-85k, a new multi-level instruction dataset (from static scenes to complex multi-agent behaviors) and a VLM-based evaluation benchmark, with experiments showing state-of-the-art generation quality and strong instruction following.
  • The work positions language-driven behavior orchestration as a shift from traditional appearance-focused synthesis toward coordinating sequential interactions in simulated worlds.

Abstract

Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.