POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that large multimodal models (LMMs) are limited by static parametric knowledge and therefore need active multimodal search for evidence retrieval from the external world.
It proposes building a multimodal agentic search model end-to-end rather than retrofitting an existing LMM with search as an add-on module.
The authors introduce “Agentic Seeding” to create training conditions that elicit agent-like behaviors from the start.
They identify a long-horizon interaction bottleneck where growing dialogue history makes it harder to find ground-truth evidence, and they mitigate it with “V-Fold,” an adaptive history-aware compression approach.
They release “POINTS-Seeker-8B,” which they report as outperforming prior multimodal agentic search models across six benchmarks, specifically improving long-horizon, knowledge-intensive visual reasoning.

Abstract

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

Black Hat Asia

AI Business

Introducing Claude Opus 4.7

Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too

TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

Dev.to

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

Key Points

Abstract

Related Articles

Black Hat Asia

Introducing Claude Opus 4.7

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer