Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning

Reddit r/LocalLLaMA / 3/22/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

Retrieval for multi-hop QA is essentially solved, with the answer found in the context 77 to 91% of the time.
The bottleneck is reasoning, with 73 to 84% of wrong answers due to the model failing to connect the dots rather than missing information.
Two inference-time tricks close the gap: structured chain of thought that decomposes questions into graph query patterns before answering, and compressing the retrieved context by about 60% via graph traversal (no extra LLM calls).
Llama 3.1 8B with these augmentations matches or exceeds vanilla Llama 3.3 70B on three benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) at roughly 12x lower cost (groq); also confirmed to work on LightRAG.

Ran a bunch of experiments with Graph RAG (KET-RAG) on multi hop question answering. Turns out retrieval is basically solved, the answer is in the context 77 to 91% of the time. The bottleneck is reasoning: 73 to 84% of wrong answers come from the model failing to connect the dots, not from missing information.

Smaller models choke on the reasoning even when the answer is sitting right there in the context.

Found that two inference time tricks close the gap:

Structured chain of thought that decomposes questions into graph query patterns before answering
Compressing the retrieved context by ~60% through graph traversal (no extra LLM calls)

End result: Llama 3.1 8B with these augmentations matches or exceeds vanilla Llama 3.3 70B on three common benchmarks at roughly 12x lower cost (groq). Tested on HotpotQA, MuSiQue, and 2WikiMultiHopQA (500 questions each).

Also confirmed it works on LightRAG, not just the one system.

arxiv: https://arxiv.org/abs/2603.14045

submitted by /u/Greedy-Teach1533
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/22DailyView insight →

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

Dev.to

v1.82.6.rc.1

LiteLLM Releases

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

Dev.to

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

Dev.to

Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning

Key Points

💡 Insights using this article

Related Articles

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

v1.82.6.rc.1

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer