Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes SSV-CoT, a method for multimodal LLMs that replaces static “visual prefix” encoding with a goal-driven, adaptive access pattern over image regions.
  • It generates a question-relevant saliency map to explicitly structure where visual attention should go, then performs reasoning in that discriminative order to create a curriculum-like progression from primary to secondary cues.
  • Training is end-to-end using text CoT and answer supervision, avoiding costly region-level annotations or specialized external tools.
  • Experiments across multiple visual reasoning benchmarks report improvements, supporting the claim that structured sequential visual cognition enhances performance.
  • The approach is motivated by human visual perception, treating attention shifts as a key mechanism for selecting informative visual information during reasoning.

Abstract

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.