Automatic Inter-document Multi-hop Scientific QA Generation

arXiv cs.CL / 3/17/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

AIM-SciQA is a new automated framework for generating inter-document, multi-hop scientific QA datasets.
It uses large language models for single-hop QAs with machine reading comprehension and builds cross-document relations through embedding-based semantic alignment and selective citation information.
Applied to 8,211 PubMed Central papers, it yields 411,409 single-hop QAs and 13,672 multi-hop QAs, forming the IM-SciQA dataset, with a citation-guided CIM-SciQA variant achieving comparable performance to the Oracle setting.
Validation by human and automatic metrics confirms high factual consistency and shows the dataset effectively differentiates retrieval and QA reasoning, providing a realistic benchmark for retrieval-augmented scientific reasoning.
The approach is extensible beyond PubMed Central, reinforcing the dataset's validity and generality across corpora.

Abstract

Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.

Reddit r/LocalLLaMA

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

AI Cybersecurity

Dev.to

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

Automatic Inter-document Multi-hop Scientific QA Generation

Key Points

Abstract

Related Articles

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

AI Cybersecurity

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer