Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

arXiv cs.CL / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Visual reasoning models that combine vision and language can overthink by generating unnecessarily long reasoning chains even when shorter reasoning would suffice.
The paper attributes this problem to “Reasoning Path Redundancy” and proposes AVR, which splits visual reasoning into perception, logical reasoning, and answer application.
AVR lets a model dynamically pick among three response formats—Full, Perception-Only, or Direct Answer—to avoid irrelevant reasoning steps.
The approach is trained using FS-GRPO, adapted from Group Relative Policy Optimization, to favor the most efficient reasoning format while keeping correctness.
Experiments on several vision-language benchmarks show 50–90% token reduction with no loss in overall accuracy, especially for tasks that are perception-heavy.

Abstract

Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: https://github.com/RunRiotComeOn/AVR.