HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering
arXiv cs.CV / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- HiMu is a training-free framework for long-form video question answering that uses a text-only LLM to decompose queries into a hierarchical logic tree with atomic predicates.
- Each predicate is routed to lightweight multimodal experts spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP) to generate modality-specific signals.
- The signals are normalized, temporally smoothed to align across modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency.
- Evaluations show HiMu improves the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while using roughly 10x fewer FLOPs.
Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer
The Batch

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

**Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems**
Dev.to

KI in der amtlichen Recherche beim DPMA: Was Patentanwälte bei Neuanmeldungen jetzt beachten sollten (Stand: März 2026)
Dev.to