HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

arXiv cs.CL / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes HaS, a “homology-aware speculative retrieval” framework to speed up Retrieval-Augmented Generation (RAG) by reducing time spent on large-scale document retrieval.
Instead of performing slow full-database retrieval, HaS first fetches candidate documents via low-latency speculative retrieval within restricted scopes, then validates candidates using query “homology” relationships.
Validation is cast as a homologous query re-identification task, so previously seen homologous queries allow the system to accept drafts and skip expensive retrieval.
Experiments show HaS lowers retrieval latency by 23.74% and 36.99% with only a 1–2% accuracy degradation, and it works as a plug-and-play acceleration method for multi-hop, agentic RAG.
The authors provide the source code on GitHub to support adoption and further experimentation.

Abstract

Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.