CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation
arXiv cs.CL / 4/1/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- CrossTrace is introduced as a cross-domain dataset containing 1,389 grounded scientific reasoning traces for hypothesis generation, covering biomedical research, AI/ML, and cross-domain work.
- Each reasoning trace follows an Input/Trace/Output schema with step-level grounding in source paper text, extending the Bit-Flip-Spark framework used by HypoGen.
- The dataset defines eight discovery patterns and includes multi-domain coverage to support evaluation and training of hypothesis-generation models across disciplines.
- Fine-tuning Qwen2.5-7B-Instruct on CrossTrace with QLoRA shows large gains in judging scores, structural compliance, and similarity metrics versus an untuned baseline, with additional improvements from balanced cross-domain training.
- Human validation on 150 sampled records reports 99.7% step-level grounding accuracy and a 0.0% fabrication rate, supporting the dataset’s claim of domain-general training value.
Related Articles

Black Hat Asia
AI Business

Knowledge Governance For The Agentic Economy.
Dev.to

AI server farms heat up the neighborhood for miles around, paper finds
The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm
Dev.to
Does the Claude “leak” actually change anything in practice?
Reddit r/LocalLLaMA