HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models
arXiv cs.CV / 3/20/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- HORNet is a lightweight frame-selection policy trained with Group Relative Policy Optimization (GRPO) to choose the frames a frozen vision-language model needs for reliable VQA performance.
- It achieves dramatic efficiency gains by reducing input frames by up to 99% and VLM processing time by up to 93%, while boosting answer quality on short-form benchmarks (+1.7% F1 on MSVD-QA) and temporal reasoning tasks (+7.3 points on NExT-QA).
- The method formalizes Select Any Frames (SAF) and generalizes better out-of-distribution than supervised or PPO baselines, with cross-model transfer yielding an additional 8.5% relative gain when paired with a stronger VLM.
- Evaluated on six benchmarks (341,877 QA pairs, 114.2 hours of video) and with publicly available code, HORNet demonstrates that choosing what the model sees is a practical complement to improving what it generates.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?
Dev.to

Building Production RAG Systems with PostgreSQL: Complete Implementation Guide
Dev.to