Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving

arXiv cs.RO / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Causal Scene Narration (CSN) to restructure Vision-Language-Action (VLA) driving prompts so that intent and relevant environmental constraints are explicitly aligned with grounded, quantitative text at inference time.
  • CSN is designed to have zero GPU cost during inference and is trained/aligned using Plackett-Luce DPO with negative log-likelihood regularization, alongside Simplex-based runtime safety supervision.
  • In multi-town closed-loop CARLA experiments, CSN improves Driving Score by +31.1% on LMDrive and +24.5% on a preference-aligned variant, indicating strong end-to-end gains.
  • Ablation results suggest that causal structure explains 39.1% of the improvement, while the remaining gains come from improved information content, and the benefits persist under realistic perception noise.
  • The study finds that semantic safety supervision improves Infraction Score, but reactive Time-To-Collision monitoring can worsen performance, implying that intent-aware monitoring is important for VLA driving safety.

Abstract

Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.