BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

arXiv cs.CL / 4/30/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper introduces a scalable framework for generating complex question-answering (QA) datasets using “graphlet-anchored” prompts derived from small subgraphs of a knowledge graph.
  • The first implementation, BioGraphletQA, provides 119,856 biomedical KGQA pairs grounded in graphlets (up to five nodes) from the OREGANO KG, often augmented with relevant PubMed document snippets.
  • A domain-expert evaluation on 106 QA pairs indicates the generated questions have high scientific validity and appropriate complexity.
  • Adding BioGraphletQA to downstream benchmarks improves accuracy: PubMedQA rises from 49.2% to 68.5% in a low-resource setup, and MedQA improves from 41.4% to 44.8% in a full-resource setup.
  • The dataset and framework code are released publicly to support reproducibility, reuse, and extension for tasks such as MCQA and KGQA.

Abstract

This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.