AI Navigate

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

arXiv cs.CV / 3/20/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • Em-Garde decouples semantic understanding from streaming perception to improve efficiency in proactive video understanding.
  • At query time, the Instruction-Guided Proposal Parser converts user queries into structured, perceptually grounded visual proposals.
  • During streaming, a Lightweight Proposal Matching Module performs embedding-based matching to trigger responses with reduced computation.
  • Experiments on StreamingBench and OVO-Bench show consistent improvements in proactive response accuracy and efficiency over prior models.
  • The work demonstrates a practical solution for proactive video understanding under strict computational constraints.

Abstract

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.