Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

arXiv cs.CL / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that standard RAG can integrate retrieved text poorly in the LLM’s context, reducing the quality of grounded reasoning and answers.
  • It introduces “Verbal Annotations,” analytic narratives that explicitly connect a search query to retrieved contexts, and shows empirically that these annotations improve the LLM’s accuracy and contextual grounding.
  • Building on this idea, it proposes Verbal-R3, an agentic RAG framework with a Generator and a Verbal Reranker that provides both relevance scores and Verbal Annotations to steer the Generator’s reasoning.
  • The method further improves inference via relevance-guided test-time scaling to allocate computation more efficiently for expanding reasoning trajectories.
  • Verbal-R3 reportedly reaches state-of-the-art results on complex question answering benchmarks, supporting the framework’s effectiveness.

Abstract

The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to bridge retrieval results and the LLM's reasoning ability through Verbal Annotations, analytic narratives that explicitly articulate the logical connection between a search query and retrieved contexts. Our empirical investigation reveals the potential of Verbal Annotations to substantially enhance the LLM's ability to generate accurate, contextually-grounded responses. Motivated by this finding, we introduce Verbal-R3, a novel agentic RAG framework that consists of a Generator and a Verbal Reranker. The Generator performs iterative retrieval and reasoning, while the Verbal Reranker returns relevance scores and Verbal Annotations to guide the reasoning and answering process of the Generator. The inference process of Verbal-R3 is further refined through relevance-guided test-time scaling, which efficiently allocates test-time compute for effective trajectory expansion. Verbal-R3 achieves state-of-the-art performance on complex Question Answering benchmarks, validating the effectiveness of the proposed framework.