StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

StratRAG is an open-source evaluation dataset designed to benchmark Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning under realistic, noisy document-pool conditions.
The dataset contains 2,200 examples derived from HotpotQA (distractor setting), covering three question types (bridge, comparison, yes-no) with pools of 15 candidate documents that include exactly 2 gold documents plus 13 topical distractors.
The authors evaluate three retrieval strategies—BM25, dense retrieval using all-MiniLM-L6-v2, and hybrid fusion—using metrics such as Recall@k, MRR, and NDCG@5.
Hybrid retrieval delivers the best overall results (Recall@2 = 0.70, MRR = 0.93), but bridge questions remain more challenging (Recall@2 = 0.67), suggesting a need for improved retrieval policies.
StratRAG is publicly available on Hugging Face for the research community to use and reproduce results.

Abstract

We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool conditions. Derived from HotpotQA (distractor setting), StratRAG comprises 2,200 examples across three question types -- bridge, comparison, and yes-no -- each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. We benchmark three retrieval strategies -- BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion -- reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall performance (Recall@2 = 0.70, MRR = 0.93), yet bridge questions remain substantially harder (Recall@2 = 0.67), motivating future work on reinforcement-learning-based retrieval policies. StratRAG is publicly available at https://huggingface.co/datasets/Aryanp088/StratRAG.

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

MarkTechPost

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

Claude Code 会话历史在哪里？如何找回你的 AI 编程对话记录

Dev.to

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

Reddit r/artificial

langchain-tests==1.1.7

LangChain Releases

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

Key Points

Abstract

Related Articles

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

An improvement of the convergence proof of the ADAM-Optimizer

Claude Code 会话历史在哪里？如何找回你的 AI 编程对话记录

We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.

langchain-tests==1.1.7

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer