MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

arXiv cs.AI / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MG$^2$-RAG, a lightweight multi-granularity graph RAG framework aimed at improving cross-modal reasoning in multimodal LLMs.
  • It builds a hierarchical multimodal knowledge graph by fusing textual entities with visual regions into unified multimodal nodes that preserve atomic evidence, avoiding costly translation-to-text pipelines.
  • MG$^2$-RAG uses a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to enable structured multi-hop reasoning.
  • Experiments on four multimodal tasks (retrieval, KB-VQA, reasoning, and classification) report state-of-the-art performance with substantial efficiency gains: an average 43.3× speedup and 23.9× cost reduction versus advanced graph-based methods.

Abstract

Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG^2-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG^2-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG^2-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3\times speedup and 23.9\times cost reduction compared with advanced graph-based frameworks.