Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

arXiv cs.AI / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Multi-Hop Fact Verification (MHFV) requires stitching together evidence across multiple steps, and LLMs often fail due to hallucinations and broken reasoning chains.
  • The paper proposes grounding verification in a Structural Causal Model (SCM), framing claim verification as a constructive causal inference process rather than relying only on Chain-of-Thought.
  • Experiments reveal an “inverted U-shaped” relationship between reasoning chain length/structural complexity and accuracy, where too much complexity reduces performance.
  • To manage this trade-off, the authors introduce a rule-based reinforcement learning approach using Group Relative Policy Optimization (GRPO) to balance structural depth and conciseness.
  • Results on HoVer and EX-FEVER show the proposed SCM-GRPO framework substantially outperforms existing baselines while remaining more interpretable and reliable.

Abstract

Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.