Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes “reasoning dynamics” in 18 vision-language models (VLMs) by tracking confidence during chain-of-thought (CoT) and measuring how effectively models correct earlier predictions.
It finds that many VLMs exhibit “answer inertia,” where early commitments tend to persist rather than being revised, even as reasoning proceeds.
Reasoning-trained models show more corrective behavior, but improvements vary strongly with modality conditions (text-dominant vs vision-only), indicating limits to robustness.
Using misleading textual cues in controlled interventions, the study shows models can remain influenced by text even when visual evidence is sufficient, and that whether this reliance is recoverable from CoT depends on both the model and what signal is monitored.
The authors conclude that CoT offers only a partial view of cross-modal decision-making, with direct implications for transparency and safety in multimodal systems.

Abstract

Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.