Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes SSV-CoT, a method for multimodal LLMs that replaces static “visual prefix” encoding with a goal-driven, adaptive access pattern over image regions.
It generates a question-relevant saliency map to explicitly structure where visual attention should go, then performs reasoning in that discriminative order to create a curriculum-like progression from primary to secondary cues.
Training is end-to-end using text CoT and answer supervision, avoiding costly region-level annotations or specialized external tools.
Experiments across multiple visual reasoning benchmarks report improvements, supporting the claim that structured sequential visual cognition enhances performance.
The approach is motivated by human visual perception, treating attention shifts as a key mechanism for selecting informative visual information during reasoning.

Abstract

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

Key Points

Abstract

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer