StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems
arXiv cs.AI / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- StratRAG is an open-source evaluation dataset designed to benchmark Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning under realistic, noisy document-pool conditions.
- The dataset contains 2,200 examples derived from HotpotQA (distractor setting), covering three question types (bridge, comparison, yes-no) with pools of 15 candidate documents that include exactly 2 gold documents plus 13 topical distractors.
- The authors evaluate three retrieval strategies—BM25, dense retrieval using all-MiniLM-L6-v2, and hybrid fusion—using metrics such as Recall@k, MRR, and NDCG@5.
- Hybrid retrieval delivers the best overall results (Recall@2 = 0.70, MRR = 0.93), but bridge questions remain more challenging (Recall@2 = 0.67), suggesting a need for improved retrieval policies.
- StratRAG is publicly available on Hugging Face for the research community to use and reproduce results.
Related Articles
How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI
MarkTechPost

An improvement of the convergence proof of the ADAM-Optimizer
Dev.to
Claude Code 会话历史在哪里?如何找回你的 AI 编程对话记录
Dev.to
We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.
Reddit r/artificial
langchain-tests==1.1.7
LangChain Releases