Ran a bunch of experiments with Graph RAG (KET-RAG) on multi hop question answering. Turns out retrieval is basically solved, the answer is in the context 77 to 91% of the time. The bottleneck is reasoning: 73 to 84% of wrong answers come from the model failing to connect the dots, not from missing information.
Smaller models choke on the reasoning even when the answer is sitting right there in the context.
Found that two inference time tricks close the gap:
- Structured chain of thought that decomposes questions into graph query patterns before answering
- Compressing the retrieved context by ~60% through graph traversal (no extra LLM calls)
End result: Llama 3.1 8B with these augmentations matches or exceeds vanilla Llama 3.3 70B on three common benchmarks at roughly 12x lower cost (groq). Tested on HotpotQA, MuSiQue, and 2WikiMultiHopQA (500 questions each).
Also confirmed it works on LightRAG, not just the one system.
[link] [comments]