Seeing the Evidence, Missing the Answer: Tool-Guided Vision-Language Models on Visual Illusions
arXiv cs.CV / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- Vision-language models are shown to have a consistent bias toward treating classic optical illusions as “real,” even after counterfactual image modifications.
- The paper proposes a tool-guided inference framework for the DataCV 2026 Challenge that fixes the illusion failure mode without retraining, by letting an off-the-shelf VLM use generic image-manipulation tools.
- An illusion-type routing prompt determines which tools to call for different perceptual question categories, and each tool call generates an immutable image resource stored in a persistent registry for the model to reuse.
- The approach demonstrates strong cross-structural generalization, maintaining performance on test sets with structurally unfamiliar illusion variants (e.g., rotated Mach Bands).
- The authors report key open questions, including a likely data-driven positive-detection bias, a gap between pixel-level spatial reasoning and higher-level logical inference over generated annotations, and heightened sensitivity to compression artifacts.




