AI Navigate

DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

arXiv cs.CL / 3/20/2026

📰 NewsModels & Research

Key Points

  • DaPT introduces a dual-path multilingual retrieval-augmented framework for multilingual multi-hop question answering (MM-hop QA).
  • The authors create multilingual MM-hop benchmarks by translating English benchmarks into five languages to enable evaluation across languages.
  • DaPT generates sub-question graphs in parallel for the source-language query and its English translation, then merges them before applying a bilingual retrieval-and-answer strategy.
  • Experimental results show that advanced RAG systems suffer from performance imbalance in multilingual scenarios, with DaPT delivering more accurate and concise answers than baselines (e.g., 18.3% relative improvement in average EM on MuSiQue).
  • The work highlights the importance of multilingual evaluation and could influence future multilingual QA research and benchmark development.

Abstract

Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems' capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs' strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.