A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

arXiv cs.AI / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • A-MAR is an agent-based multimodal art retrieval framework that improves artwork understanding by explicitly using structured reasoning plans rather than relying on implicit internal knowledge.
  • Given an artwork and a query, A-MAR decomposes the task into step-by-step goals and evidence requirements, then conditions retrieval on that plan to enable more targeted, evidence-grounded explanations.
  • The paper introduces ArtCoT-QA, a diagnostic benchmark designed to evaluate multi-step reasoning chains for art-related questions beyond single final-answer accuracy.
  • Experiments on datasets including SemArt and Artpedia show A-MAR outperforms static, non-planned retrieval and strong MLLM baselines in the quality of explanations, with further gains in evidence grounding and multi-step reasoning on ArtCoT-QA.
  • The authors provide code and data via GitHub, positioning A-MAR as a move toward more interpretable, goal-driven AI systems for knowledge-intensive cultural applications.

Abstract

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.