HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
arXiv cs.CV / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- HiMu is a training-free framework for long-form video question answering that uses a text-only LLM to decompose queries into a hierarchical logic tree with atomic predicates.
- Each predicate is routed to lightweight multimodal experts spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP) to generate modality-specific signals.
- The signals are normalized, temporally smoothed to align across modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency.
- Evaluations show HiMu improves the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while using roughly 10x fewer FLOPs.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA