Act2See: Emergent Active Visual Perception for Video Reasoning

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Vision-Language Models (VLMs) often use only static initial frames for video reasoning, which limits their ability to incorporate dynamic evidence as reasoning progresses.
The proposed Act-to-See (Act2See) framework lets VLMs actively interleave video frames into text-based Chain-of-Thought (CoT), improving visual synthesis and enabling hypothetical/counterfactual reasoning.
Act2See is trained via supervised fine-tuning (SFT) on high-quality reasoning-trace data generated by a frontier VLM, where the traces include verified active frame retrieval or synthesis steps.
At inference time, the model dynamically decides when to retrieve existing frames or generate/synthesize new ones to obtain the needed visual evidence.
Experiments report new state-of-the-art performance on VideoEspresso and ViTIB, and improvements over comparable or larger models on several other video reasoning benchmarks.

Abstract

Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.

The 55.6% problem: why frontier LLMs fail at embedded code

Dev.to

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Dev.to

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

Reddit r/artificial

The Transformer: The Architecture Behind Modern AI

Dev.to

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Dev.to

Act2See: Emergent Active Visual Perception for Video Reasoning

Key Points

Abstract

Related Articles

The 55.6% problem: why frontier LLMs fail at embedded code

Four CVEs in a week, all the same shape: when agents execute LLM-generated code

Healthcare AI Is Absorbing Institutional Knowledge It Can't Actually Hold

The Transformer: The Architecture Behind Modern AI

Foundational Models Defining a New Era in Vision: A Survey and Outlook

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer