MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

arXiv cs.AI / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MG$^2$-RAG, a lightweight multi-granularity graph RAG framework aimed at improving cross-modal reasoning in multimodal LLMs.
It builds a hierarchical multimodal knowledge graph by fusing textual entities with visual regions into unified multimodal nodes that preserve atomic evidence, avoiding costly translation-to-text pipelines.
MG$^2$-RAG uses a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to enable structured multi-hop reasoning.
Experiments on four multimodal tasks (retrieval, KB-VQA, reasoning, and classification) report state-of-the-art performance with substantial efficiency gains: an average 43.3× speedup and 23.9× cost reduction versus advanced graph-based methods.

Abstract

Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG

^2

-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG

^2

-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG

^2

-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3

\times

speedup and 23.9

\times

cost reduction compared with advanced graph-based frameworks.

Black Hat Asia

AI Business

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

Dev.to

We are building an OS for AI-built software. Here's what that means

Dev.to

Claude Code Forgot My Code. Here's Why.

Dev.to

Whats'App Ai Assistant

Dev.to

MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

Key Points

Abstract

Related Articles

Black Hat Asia

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

We are building an OS for AI-built software. Here's what that means

Claude Code Forgot My Code. Here's Why.

Whats'App Ai Assistant

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer