How LLMs decide which pages to cite — and how to optimize for it

Reddit r/artificial / 4/20/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article explains how LLM question answering often uses RAG, retrieving candidate pages from a crawled index and then scoring them before citing or using them.
  • It lists public research signals for ranking/citation decisions, including answer directness, presence of cited statistics, structured data via JSON-LD, crawl accessibility, and content freshness.
  • It highlights a major finding from the referenced Princeton GEO paper: adding schema markup can significantly improve precise information extraction (reported from 16% to 54%).
  • It argues that this improvement can be the difference between a page being cited by LLM systems versus remaining effectively invisible.
  • It ends by inviting readers to share their own experiments and what tactics are working for them.

When ChatGPT or Perplexity answers a question, it runs RAG: retrieves top candidates from a crawled index, then scores them. The scoring criteria are public knowledge from the Princeton GEO paper (arxiv.org/abs/2311.09735).

Key signals: answer directness, cited statistics, structured data (JSON-LD), crawl access, and content freshness.

What surprised me most in the research: schema markup alone shifts precise information extraction from 16% to 54%. That's not a marginal gain — that's the difference between being cited and being invisible.

Anyone else experimenting with this? Curious what's working for people here.

submitted by /u/esteban-vera
[link] [comments]