H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations

arXiv cs.CL / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces H-RAG, a hierarchical parent-child retrieval pipeline submitted to SemEval-2026 Task 8 (MTRAGEval), covering both retrieval quality (Task A) and multi-turn RAG generation with evidence grounding (Task C).
  • H-RAG splits documents into overlapping sentence-based “child” chunks for fine-grained retrieval, while retaining full documents as “parent” units to reconstruct coherent context during generation.
  • The retrieval stage uses a hybrid dense-sparse search with tunable weighting plus embedding-similarity rescoring over child chunks, then aggregates retrieved evidence at the parent level for the language model.
  • Reported results show nDCG@5 of 0.4271 on Task A and a harmonic mean of 0.3241 on Task C, highlighting that retrieval configuration and parent-level evidence aggregation are critical for multi-turn RAG performance.

Abstract

We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent-child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation. Documents are segmented into overlapping sentence-based child chunks, while full documents are preserved as parent units to provide coherent context. Retrieval combines hybrid dense-sparse search, tunable weighting, and embedding-based similarity rescoring over child chunks. Retrieved evidence is aggregated at the parent level and supplied to an instruction-tuned language model for response generation. H-RAG achieves an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C (RB_agg: 0.2488, RL_F: 0.2703, RB_llm: 0.6508), underscoring the importance of retrieval configuration and parent-level aggregation in multi-turn RAG performance.