MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG

arXiv cs.CL / 4/28/2026

📰 NewsModels & Research

Key Points

  • MEG-RAG targets shortcomings of Multimodal Retrieval-Augmented Generation (MRAG) by improving how systems judge whether retrieved multimodal evidence truly supports the semantic core of an answer.
  • The article introduces Multi-modal Evidence Grounding (MEG), a semantic-aware metric that estimates evidence contribution using “Semantic Certainty Anchoring” based on high-IDF, information-rich tokens.
  • Building on MEG, MEG-RAG trains a multimodal reranker to align retrieved evidence with semantic anchors from ground truth, prioritizing high-value content over simple token-probability heuristics.
  • Experiments on the M$^2$RAG benchmark indicate that MEG-RAG outperforms strong baselines and generalizes robustly across different teacher models.
  • Overall, the work provides both a new evaluation/quantification metric (MEG) and an associated training framework (MEG-RAG) to reduce hallucinations and boost multimodal consistency in generated outputs.

Abstract

Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M^2RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.