VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

arXiv cs.CL / 3/23/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

VideoSeek introduces a long-horizon video agent that uses a think-act-observe loop and a toolkit to collect multi-granular observations, reducing the need to densely sample frames.
The approach leverages video logic flow to actively seek evidence for queries, maintaining or improving video understanding while using far fewer frames.
On four challenging benchmarks, VideoSeek achieves strong accuracy and outperforms the base model GPT-5 on LVBench by 10.2 absolute points while using 93% fewer frames.
The work underscores the importance of toolkit design and robust reasoning capabilities for practical video understanding and reasoning.

Abstract

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/23DailyView insight →

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

Dev.to

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

v0.18.3

Ollama Releases

"Why Your AI Agent Needs a System 1"

Dev.to

ChatterMate vs Chatwoot vs Typebot: Which Open-Source Chat Platform Is Right for You?

Dev.to

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Key Points

Abstract

💡 Insights using this article

Related Articles

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

AgentDesk vs Hiring Another Consultant: A Cost Comparison

v0.18.3

"Why Your AI Agent Needs a System 1"

ChatterMate vs Chatwoot vs Typebot: Which Open-Source Chat Platform Is Right for You?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer