Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
arXiv cs.AI / 4/27/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how to keep “planning” decisions consistent with a model’s own earlier perceptions in hierarchical driving visual question answering (GVQA) using cross-stage context passing on DriveLM-nuScenes.
- An explicit, training-free approach compares three prompt-based conditioning strategies on a domain-adapted 4B VLM, cutting NLI contradictions by up to 42.6% and serving as a strong baseline.
- An implicit approach adds learned gated context projectors that transfer a hidden-state representation from one stage to the next, trained with stage-specific QLoRA adapters while updating only ~0.5% of parameters.
- The implicit method yields statistically significant improvements, including a 34% reduction in planning-stage NLI contradiction and a 50% increase in cross-stage entailment, plus a CIDEr +30.3% gain in planning language quality.
- The authors note limitations from using non-driving-domain pretraining for the implicit setup, which harms lexical/structural consistency, and conclude that combining these strategies with better domain adaptation is a promising next step.

