LLM-Oriented Information Retrieval: A Denoising-First Perspective

arXiv cs.AI / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that information retrieval for LLMs is fundamentally different from human-focused IR because LLMs have limited attention and are highly sensitive to noise, which can directly trigger hallucinations and reasoning failures.
It proposes that a “denoising-first” approach—maximizing usable evidence density and verifiability within the model’s context window—is becoming the main bottleneck across the entire information access pipeline.
The authors introduce a four-stage framework describing how information can move from being inaccessible to undiscoverable, then misaligned, and finally unverifiable in LLM-based workflows.
They provide a pipeline-organized taxonomy of signal-to-noise optimization methods across indexing, retrieval, context engineering, verification, and agentic search/agent workflows.
The paper reviews research directions in retrieval-heavy applications such as lifelong assistants, coding agents, deep research systems, and multimodal understanding.

Abstract

Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.