UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces UnAC, a multimodal prompting method aimed at improving LMM performance on complex, multi-step reasoning over visual evidence.
  • UnAC uses adaptive visual prompting to help models focus on salient image regions and an image-abstraction prompt to extract key information more effectively.
  • It further adds a stepwise self-checking mechanism that verifies each decomposed subquestion and its proposed answer to reduce reasoning errors.
  • The approach is evaluated on three public benchmarks—MathVista, MM-Vet, and MMMU—using models such as GPT-4o, Gemini 1.5, and GPT-4V.
  • Overall, the work targets a common limitation of current LMMs: strong visual perception paired with unreliable multi-step reasoning for evidence-based tasks.

Abstract

Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT-4o, Gemini 1.5, and GPT-4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image-abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self-checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks-MathVista, MM-Vet, and MMMU.