From Scene to Object: Text-Guided Dual-Gaze Prediction
arXiv cs.CV / 4/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that existing driver attention datasets lack object-level gaze annotations, which makes text-grounded cognitive modeling difficult and can cause text-vision decoupling and visual-bias hallucinations in VLMs.
- To address this, it introduces G-W3DA, an object-level driver attention dataset built by combining a multimodal large language model with SAM3 to convert scene-level heatmaps into object-level masks while reducing annotation hallucinations through rigorous cross-validation.
- It also proposes DualGaze-VLM, a dual-branch architecture that uses semantic-query hidden states and a Condition-Aware SE-Gate to dynamically modulate visual features for intent-driven, spatially anchored predictions.
- Experiments on the W3DA benchmark show that DualGaze-VLM improves spatial alignment over prior SOTA, achieving up to a 17.8% gain in Similarity (SIM) in safety-critical scenarios.
- A “visual Turing test” indicates that the generated attention heatmaps are judged authentic by 88.22% of human evaluators, suggesting the model can produce reasonable cognitive priors.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Why use an AI gateway at all?
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to