Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • Retrieval for multi-hop QA is essentially solved, with the answer found in the context 77 to 91% of the time.
  • The bottleneck is reasoning, with 73 to 84% of wrong answers due to the model failing to connect the dots rather than missing information.
  • Two inference-time tricks close the gap: structured chain of thought that decomposes questions into graph query patterns before answering, and compressing the retrieved context by about 60% via graph traversal (no extra LLM calls).
  • Llama 3.1 8B with these augmentations matches or exceeds vanilla Llama 3.3 70B on three benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) at roughly 12x lower cost (groq); also confirmed to work on LightRAG.

Ran a bunch of experiments with Graph RAG (KET-RAG) on multi hop question answering. Turns out retrieval is basically solved, the answer is in the context 77 to 91% of the time. The bottleneck is reasoning: 73 to 84% of wrong answers come from the model failing to connect the dots, not from missing information.

Smaller models choke on the reasoning even when the answer is sitting right there in the context.

Found that two inference time tricks close the gap:

  • Structured chain of thought that decomposes questions into graph query patterns before answering
  • Compressing the retrieved context by ~60% through graph traversal (no extra LLM calls)

End result: Llama 3.1 8B with these augmentations matches or exceeds vanilla Llama 3.3 70B on three common benchmarks at roughly 12x lower cost (groq). Tested on HotpotQA, MuSiQue, and 2WikiMultiHopQA (500 questions each).

Also confirmed it works on LightRAG, not just the one system.

arxiv: https://arxiv.org/abs/2603.14045

submitted by /u/Greedy-Teach1533
[link] [comments]