Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

arXiv cs.CV / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes improving open-domain Visual Question Answering (VQA) by integrating multimodal LLMs with retrieval-augmented generation (RAG) more effectively.
It introduces a logical prompting strategy called CoVQD that combines Chain-of-Thought reasoning with Visual Question Decomposition to better steer retrieval toward relevant knowledge.
Building on CoVQD, the authors present a new framework, CoVQD-guided RAG (CgRAG), designed to provide more coherent and comprehensive external knowledge during multimodal inference.
Experiments on E-VQA, InfoSeek, and OKVQA benchmarks show that the approach improves performance and generalization/reliability in complex cross-domain VQA settings.
Overall, the work advances retrieval-based VQA by coupling structured visual-text reasoning with knowledge acquisition to make multimodal LLM answers more dependable.

Abstract

With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

SIFS (SIFS Is Fast Search) - local code search for coding agents

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer