RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes RASR, a retrieval-augmented framework for multimodal fake news video detection that aims to improve reasoning beyond conventional feature fusion or consistency checks.
  • RASR uses a Cross-instance Semantic Parser and Retriever (CSPR) to decompose videos into semantic primitives and pull related historical evidence from a dynamic memory bank.
  • A Domain-Guided Multimodal Reasoning (DGMP) module injects domain priors to steer an expert multimodal large language model toward generating domain-aware analysis reports.
  • The Multi-View Feature Decoupling and Fusion (MVDFF) module combines multi-dimensional features via adaptive gating to strengthen authenticity decisions.
  • Experiments on FakeSV and FakeTT show RASR achieves state-of-the-art performance, better cross-domain generalization, and up to a 0.93% accuracy improvement over baselines.

Abstract

Multimodal fake news video detection is a crucial research direction for maintaining the credibility of online information. Existing studies primarily verify content authenticity by constructing multimodal feature fusion representations or utilizing pre-trained language models to analyze video-text consistency. However, these methods still face the following limitations: (1) lacking cross-instance global semantic correlations, making it difficult to effectively utilize historical associative evidence to verify the current video; (2) semantic discrepancies across domains hinder the transfer of general knowledge, lacking the guidance of domain-specific expert knowledge. To this end, we propose a novel Retrieval-Augmented Semantic Reasoning (RASR) framework. First, a Cross-instance Semantic Parser and Retriever (CSPR) deconstructs the video into high-level semantic primitives and retrieves relevant associative evidence from a dynamic memory bank. Subsequently, a Domain-Guided Multimodal Reasoning (DGMP) module incorporates domain priors to drive an expert multimodal large language model in generating domain-aware, in-depth analysis reports. Finally, a Multi-View Feature Decoupling and Fusion (MVDFF) module integrates multi-dimensional features through an adaptive gating mechanism to achieve robust authenticity determination. Extensive experiments on the FakeSV and FakeTT datasets demonstrate that RASR significantly outperforms state-of-the-art baselines, achieves superior cross-domain generalization, and improves the overall detection accuracy by up to 0.93%.