Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MACF, an end-to-end multi-agent collaboration framework to improve multi-modal (vision-language) understanding on long videos despite limited per-model perception context budgets.
  • MACF splits videos into segments processed by locally budgeted agents, while a central coordinator enables global reasoning through an agent-native latent communication protocol.
  • Agents compress partial visual observations into compact, task-sufficient tokens within a shared embedding space to collaborate efficiently without relying on lossy textual intermediates.
  • The authors propose curriculum training that gradually strengthens semantic alignment, evidence summarization, and cross-agent coordination.
  • Experiments on multiple video understanding benchmarks indicate MACF outperforms existing SOTA MLLMs and multi-agent systems under the same budget constraints, suggesting scalable, information-preserving video understanding.

Abstract

Multi-modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long-video tasks due to bounded perception context budgets. Existing agentic methods mitigate this via rule-based preprocessing, yet often suffer from information loss, high cost, and reliance on textual intermediates. We propose MACF, an end-to-end Multi-Agent Collaboration Framework that decouples per-agent perception budgets from global video complexity, enabling scalable video understanding while preserving visual fidelity. MACF partitions videos into segments for locally budgeted agents and enables holistic reasoning via an agent-native latent communication protocol. Each agent encodes partial observations into compact, task-sufficient tokens in a shared embedding space, allowing efficient and information-preserving collaboration by a central coordinator. We introduce a curriculum training strategy that progressively enforces semantic alignment, evidence summarization, and cross-agent coordination. Extensive experiments on diverse video understanding benchmarks show that MACF consistently outperforms state-of-the-art MLLMs and multi-agent systems under identical budget constraints, demonstrating the effectiveness of our latent collaboration for scalable video understanding.

Scaling Video Understanding via Compact Latent Multi-Agent Collaboration | AI Navigate