From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing medical RAG systems miss key evidence-based medicine (EBM) requirements, specifically PICO alignment and evidence hierarchy during reranking.
  • It introduces SR-RAG, an EBM-adapted GraphRAG framework that incorporates the PICO framework into knowledge-graph construction and retrieval.
  • The study proposes Bayesian Evidence Tier Reranking (BETR) to calibrate ranking scores according to evidence grade without relying on predefined weights.
  • Experiments in sports rehabilitation show strong retrieval, faithfulness, and semantic metrics, including 0.812 evidence recall@10, 0.819 answer faithfulness, and 0.788 PICOT match accuracy.
  • The authors release a large sports-rehabilitation knowledge graph (357,844 nodes, 371,226 edges) plus a benchmark dataset of 1,637 QA pairs, supported by clinician Likert ratings and human-verified evaluation.

Abstract

Current medical retrieval-augmented generation (RAG) approaches overlook evidence-based medicine (EBM) principles, leading to two key gaps: (1) the lack of PICO alignment between queries and retrieved evidence, and (2) the absence of evidence hierarchy considerations during reranking. We present SR-RAG, an EBM-adapted GraphRAG framework that integrates the PICO framework into knowledge graph construction and retrieval, and proposes Bayesian Evidence Tier Reranking (BETR) to calibrate ranking scores by evidence grade without predefined weights. Validated in sports rehabilitation, we release a knowledge graph (357,844 nodes, 371,226 edges) and a benchmark of 1,637 QA pairs. SR-RAG achieves 0.812 evidence recall@10, 0.830 nugget coverage, 0.819 answer faithfulness, 0.882 semantic similarity, and 0.788 PICOT match accuracy, substantially outperforming five baselines. Five expert clinicians rated the system 4.66--4.84 on a 5-point Likert scale, and system rankings are preserved on a human-verified gold subset (n=80).
広告