Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
arXiv cs.CL / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Omanic introduces an open-domain multi-hop QA resource with decomposed sub-questions and intermediate answers to enable step-wise analysis of reasoning.
- The dataset comprises OmanicSynth (10,296 machine-generated training examples) and OmanicBench (967 expert-reviewed evaluation examples) designed to diagnose reasoning processes.
- State-of-the-art LLMs achieve 73.11% multiple-choice accuracy on OmanicBench, indicating the task's difficulty and the need for step-level annotations.
- Supervised fine-tuning on OmanicSynth yields substantial transfer gains across six reasoning and math benchmarks, validating the dataset's usefulness for reasoning-capability transfer.
- The data and code are released publicly at HuggingFace and GitHub (https://huggingface.co/datasets/li-lab/Omanic, https://github.com/XiaojieGu/Omanic).




