Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
arXiv cs.CV / 3/19/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Symphony proposes a cognitively-inspired multi-agent system to improve long-video understanding by decomposing LVU into fine-grained subtasks and enabling reflection-enhanced deep reasoning collaboration.
- It introduces a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments to identify complex problems with long temporal spans.
- The method aims to overcome limitations of simple task decomposition and embedding-based retrieval that risk losing key information in long contexts.
- Experiments show state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement on LVBench, and the code is available on GitHub.




