When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the Graded Color Attribution (GCA) dataset as a controlled benchmark for studying when vision-language models (VLMs) will behave unexpectedly and whether they follow their own stated introspective rules.
  • In GCA, both humans and VLMs learn a pixel-level threshold rule for when an object should be labeled with a given color based on minimum color coverage under multiple recoloring conditions.
  • Results show humans remain largely faithful to their stated rules, and apparent human “violations” are attributed to overestimation of color coverage rather than rule-breaking.
  • In contrast, VLMs systematically contradict their own introspective rules, even when they are strong estimators of color coverage, with GPT-5-mini violating stated rules in nearly 60% of cases under strong color priors.
  • The findings indicate that world-knowledge priors reduce introspection faithfulness for models in patterns unlike human cognition, suggesting VLM self-knowledge is miscalibrated and raising concerns for trustworthy, high-stakes deployment.

Abstract

Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60\% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.