Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The new Chain-of-Glimpse framework targets video understanding by grounding each multi-step reasoning step in specific visual evidence regions rather than using object-agnostic cues.
It formulates video reasoning as an incremental, step-by-step process that builds spatially grounded traces around task-relevant objects, reducing over-reliance on saliency.
A search-guided controller is trained with reinforcement learning using a format reward that strongly encourages effective grounding and produces reliable reasoning trajectories.
Experiments on multiple benchmarks (NExTQA, Video-Holmes, CG-Bench Reasoning, VRBench) show consistent improvements, as well as robustness and better generalization across different video reasoning tasks.
The approach is designed to support compositional and interpretable multi-step decision-making for semantically discriminative objects across frames.

Abstract

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

langchain-anthropic==1.4.1

LangChain Releases

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Dev.to

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

Dev.to

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

Dev.to

The problem with Big Tech AI pricing (and why 8 countries can't afford to compete)

Dev.to

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Key Points

Abstract

Related Articles

langchain-anthropic==1.4.1

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.

The problem with Big Tech AI pricing (and why 8 countries can't afford to compete)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer