Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation

arXiv cs.CL / 4/23/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper addresses the “Lost-in-the-Middle” effect in LLMs, which reduces attention to information located in the middle of long context windows and can limit knowledge-retrieval over large structured libraries.
  • It proposes Self-Describing Structured Retrieval (SDSR), a lightweight approach where structured data files include human-authored navigation metadata placed at the primacy position to leverage LLM attention biases.
  • SDSR uses a Dual-Layer Guidance strategy that combines in-file metadata with explicit routing rules in the system prompt to improve precision retrieval.
  • In a four-round benchmark over an expanded 190-skill library (36→119 categories) with adversarial distractor injection, the combined approach (in-file + prompt guidance) achieves 100% primary routing accuracy versus 65% without guidance.
  • The work also extends SDSR to semi-structured corpora, suggesting cross-reference encoding can support retrieval without vector databases when document structure is recoverable.

Abstract

Large Language Models (LLMs) exhibit a well-documented positional bias when processing long input contexts: information in the middle of a context window receives substantially less attention than content at the boundaries, a phenomenon termed the Lost-in-the-Middle effect (Liu et al., 2024). This limits knowledge-retrieval applications that embed large structured knowledge bases directly in the LLM context. Retrieval-Augmented Generation (RAG) addresses scalability by retrieving only relevant fragments, but introduces substantial infrastructure overhead and is ill-suited to libraries whose semantic boundaries are human-defined rather than statistically learned. We propose Self-Describing Structured Retrieval (SDSR), a lightweight framework in which structured data files embed human-authored navigational metadata at the file's primacy position, thereby exploiting rather than fighting the LLM's primacy bias. We further propose a Dual-Layer Guidance strategy combining in-file metadata with explicit routing rules in the system prompt. We validate SDSR through a four-round benchmark using a 190-skill library expanded from 36 to 119 categories via adversarial distractor injection. Four conditions are tested: (A) no guidance, (B) in-file summary only, (C) prompt hint only, (D) both combined. Version D achieves 100% primary routing accuracy (20/20) at 119 categories versus 65% for the no-guidance baseline. We identify a fundamental asymmetry: primary routing is solvable by explicit rules, while secondary cross-category routing requires architectural intent explicitly encoded in the data structure. We further extend SDSR to semi-structured corpora, showing how cross-reference encoding enables operation without vector databases in domains with recoverable document structure.