Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

arXiv cs.CL / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Visual reasoning models that combine vision and language can overthink by generating unnecessarily long reasoning chains even when shorter reasoning would suffice.
  • The paper attributes this problem to “Reasoning Path Redundancy” and proposes AVR, which splits visual reasoning into perception, logical reasoning, and answer application.
  • AVR lets a model dynamically pick among three response formats—Full, Perception-Only, or Direct Answer—to avoid irrelevant reasoning steps.
  • The approach is trained using FS-GRPO, adapted from Group Relative Policy Optimization, to favor the most efficient reasoning format while keeping correctness.
  • Experiments on several vision-language benchmarks show 50–90% token reduction with no loss in overall accuracy, especially for tasks that are perception-heavy.

Abstract

Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.