SWE-QA: A Dataset and Benchmark for Complex Code Understanding
arXiv cs.AI / 4/29/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SWE-QA, a benchmark dataset designed to evaluate multi-hop code comprehension that mirrors the information-chaining needed in real software development.
- SWE-QA contains 9,072 multiple-choice questions generated from 12 Python repositories derived from SWE-bench, focusing on reasoning patterns such as Declaration-and-Call and Interacting-Entity.
- The dataset creation uses parsing-based entity extraction and LLM-assisted question generation with validated distractors to reduce the risk of superficial pattern matching.
- Experiments on 15 language models (360M to 671B parameters) show that multi-hop reasoning remains difficult, with the top accuracy reported at 74.41%.
- Dense model architectures outperform mixture-of-experts models by 10–14 percentage points, while reasoning-enhanced variants provide inconsistent gains.


