Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding
arXiv cs.CV / 4/2/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a key limitation of MLLMs for long-form video QA—limited context length and high compute—by focusing on efficient keyframe sampling.
- It proposes an evidence-driven sampling objective using information bottleneck theory, maximizing conditional mutual information between selected frames and the user query to better capture evidential clues.
- The method makes subset selection tractable by decomposing the optimization into independent frame-level scoring, avoiding inefficient combinatorial search.
- A query-conditioned evidence scoring network is introduced and trained with a contrastive objective to estimate each frame’s evidential importance efficiently.
- Experiments on long-form video understanding benchmarks show consistent improvements over prior sampling strategies under strict token budgets and better training efficiency.
Related Articles

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to

Qwen3.6-Plus: Alibaba's Quiet Giant in the AI Race Delivers a Million-Token Enterprise Powerhouse
Dev.to

How To Leverage AI for Back-Office Headcount Optimization
Dev.to
Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.
Reddit r/LocalLLaMA
SOTA Language Models Under 14B?
Reddit r/LocalLLaMA