Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study evaluates five model sizes from 360M to 8B across SmolLM2, Qwen2.5, and Llama 3.1, under four retrieval conditions to assess how effectively smaller models utilize retrieved information.
- For models at or below 7B parameters, even with oracle retrieval, they fail to extract the correct answer 85–100% of the time on questions they cannot answer without external knowledge, revealing a fundamental utilization bottleneck.
- Introducing retrieval context often destroys 42–100% of answers the model previously knew, indicating a distraction effect driven by the presence of context rather than its quality.
- An analysis of 2,588 oracle failures shows the dominant error mode is irrelevant generation, where the model ignores the provided context, a finding consistent across prompts and retrieval methods.
- The authors conclude that for sub-7B models, context utilization is the main limitation of RAG, and deploying RAG at this scale can yield a net negative trade-off under standard evaluation conditions.




