Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
arXiv cs.CV / 3/16/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- Introduces EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks.
- Proposes Hand Intent Tokens (HINT) derived from 3D hand keypoints and interleaves them with model input to provide explicit spatial and temporal context for interpreting pointing intent.
- Demonstrates that HINT improves performance across backbones and model sizes, with HINT-14B achieving 68.1% accuracy on average over 6 tasks, surpassing the state-of-the-art InternVL3-14B by 6.6%.
- Will release the code, model, and dataset to open research, with a project page at https://yuuraa.github.io/papers/choi2026egovqa.
- Addresses gaps in gesture-rich data for egocentric AI assistants and advances gesture-based VQA by enabling more accurate understanding of pointing gestures.




