Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces Object Referring-guided Scanpath Prediction (ORSP), which predicts human visual attention scanpaths for a target object specified by a referring expression.
It proposes ScanVLA, a model that uses a vision-language model (VLM) to extract and fuse visually and linguistically aligned representations from an input image and the referring text.
To improve fine-grained positional accuracy, the work adds a History Enhanced Scanpath Decoder (HESD) that leverages past fixation positions when predicting the next fixation.
The approach further incorporates a frozen Segmentation LoRA as an auxiliary module to localize the referred object more precisely while avoiding significant extra compute or time costs.
Experiments show ScanVLA substantially outperforms prior scanpath prediction methods in the object-referring setting.

Abstract

Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations' position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object more precisely, which improves the scanpath prediction task without incurring additional large computational and time costs. Extensive experimental results demonstrate that ScanVLA can significantly outperform existing scanpath prediction methods under object referring.