Adaptive Greedy Frame Selection for Long Video Understanding
arXiv cs.CL / 3/23/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper tackles inference bottlenecks in long-video understanding by proposing a question-adaptive greedy frame selection that balances query relevance and semantic representativeness under a fixed frame budget.
- It builds a 1 FPS candidate pool (capped at 1000) with exact timestamps and uses SigLIP for relevance and DINOv2 for semantic similarity to evaluate frames.
- Frames are selected by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term, yielding a normalized, monotone, submodular objective with a (1-1/e) approximation guarantee.
- It introduces four preset strategies and a lightweight text-only question-type classifier to route queries to the best-performing preset, enabling question-dependent trade-offs.
- Experiments on MLVU demonstrate consistent accuracy gains over uniform sampling and strong baselines, especially at tight frame budgets.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
Stop Treating AI Interview Fraud Like a Proctoring Problem
Dev.to
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
InVideo AI Review: Fast Finished
Dev.to