Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

arXiv cs.CV / 3/20/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

Em-Garde decouples semantic understanding from streaming perception to improve efficiency in proactive video understanding.
At query time, the Instruction-Guided Proposal Parser converts user queries into structured, perceptually grounded visual proposals.
During streaming, a Lightweight Proposal Matching Module performs embedding-based matching to trigger responses with reduced computation.
Experiments on StreamingBench and OVO-Bench show consistent improvements in proactive response accuracy and efficiency over prior models.
The work demonstrates a practical solution for proactive video understanding under strict computational constraints.

Abstract

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

The Batch

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems

Dev.to

At Palantir’s Developer Conference, AI Is Built to Win Wars

Wired

LongCat-Flash-Prover: A new frontier for Open-Source Formal Reasoning.

Reddit r/LocalLLaMA

composer 2 is just Kimi K2.5 with RL?????

Reddit r/LocalLLaMA

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Key Points

Abstract

Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems

At Palantir’s Developer Conference, AI Is Built to Win Wars

LongCat-Flash-Prover: A new frontier for Open-Source Formal Reasoning.

composer 2 is just Kimi K2.5 with RL?????

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

**Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems**

At Palantir’s Developer Conference, AI Is Built to Win Wars

LongCat-Flash-Prover: A new frontier for Open-Source Formal Reasoning.

composer 2 is just Kimi K2.5 with RL?????

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems