GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding
arXiv cs.AI / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces GazeQwen, a lightweight, parameter-efficient method to make multimodal LLMs use eye-gaze information for streaming video understanding despite prior models struggling to incorporate gaze cues effectively.
- GazeQwen uses a compact gaze resampler (about 1–5M trainable parameters) that encodes V-JEPA 2.1 video features plus fixation-based positional encodings, generating additive residuals injected into chosen LLM decoder layers via forward hooks.
- An optional second training stage further improves gaze integration by adding LoRA modules to the underlying open-source MLLM.
- On the StreamGaze benchmark (all 10 tasks), GazeQwen achieves 63.9% accuracy, outperforming the same Qwen2.5-VL-7B backbone with gaze treated as visual prompts (+16.1 points) and surpassing GPT-4o among tested models (+10.5 points).
- The results indicate that learning optimal “where to inject gaze” inside an LLM can be more effective than simply increasing model size or refining prompt engineering.

