AI Navigate

Test-Time Strategies for More Efficient and Accurate Agentic RAG

arXiv cs.AI / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates test-time modifications to the Search-R1 Retrieval-Augmented Generation pipeline to reduce inefficiencies such as repeated retrieval and poor contextualization.
  • It proposes two components—a contextualization module to better fuse retrieved documents into reasoning and a de-duplication module that replaces earlier retrieved documents with newer, more relevant ones.
  • The evaluation uses HotpotQA and Natural Questions, reporting EM scores, LLM-as-a-Judge assessments, and the average number of retrieval turns.
  • The best-performing variant uses GPT-4.1-mini for contextualization and achieves a 5.6% increase in EM and a 10.5% reduction in turns versus the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.

Abstract

Retrieval-Augmented Generation (RAG) systems face challenges with complex, multihop questions, and agentic frameworks such as Search-R1 (Jin et al., 2025), which operates iteratively, have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to the Search-R1 pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant, utilizing GPT-4.1-mini for contextualization, achieves a 5.6% increase in EM score and reduces the number of turns by 10.5% compared to the Search-R1 baseline, demonstrating improved answer accuracy and retrieval efficiency.