From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines

arXiv cs.CL / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that generative information retrieval in web search should optimize not only relevance but also document trustworthiness, especially for high-stakes domains like healthcare and finance.
It proposes Authority-aware Generative Retriever (AuthGR), the first framework that explicitly incorporates “authority” into GenIR using a vision-language model for multimodal authority scoring.
AuthGR uses a three-stage training pipeline to progressively teach the retriever authority awareness, followed by a hybrid ensemble deployment approach.
Offline experiments show improved authority and retrieval accuracy, including a 3B model that matches the performance of a 14B baseline.
Large-scale online A/B tests and human evaluations on a commercial web search platform indicate significant gains in real-world user engagement and perceived reliability.

Abstract

Generative information retrieval (GenIR) formulates the retrieval process as a text-to-text generation task, leveraging the vast knowledge of large language models. However, existing works primarily optimize for relevance while often overlooking document trustworthiness. This is critical in high-stakes domains like healthcare and finance, where relying solely on semantic relevance risks retrieving unreliable information. To address this, we propose an Authority-aware Generative Retriever (AuthGR), the first framework that incorporates authority into GenIR. AuthGR consists of three key components: (i) Multimodal Authority Scoring, which employs a vision-language model to quantify authority from textual and visual cues; (ii) a Three-stage Training Pipeline to progressively instill authority awareness into the retriever; and (iii) a Hybrid Ensemble Pipeline for robust deployment. Offline evaluations demonstrate that AuthGR successfully enhances both authority and accuracy, with our 3B model matching a 14B baseline. Crucially, large-scale online A/B tests and human evaluations conducted on the commercial web search platform confirm significant improvements in real-world user engagement and reliability.