How LLMs decide which pages to cite — and how to optimize for it

Reddit r/artificial / 4/20/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The article explains how LLM question answering often uses RAG, retrieving candidate pages from a crawled index and then scoring them before citing or using them.
It lists public research signals for ranking/citation decisions, including answer directness, presence of cited statistics, structured data via JSON-LD, crawl accessibility, and content freshness.
It highlights a major finding from the referenced Princeton GEO paper: adding schema markup can significantly improve precise information extraction (reported from 16% to 54%).
It argues that this improvement can be the difference between a page being cited by LLM systems versus remaining effectively invisible.
It ends by inviting readers to share their own experiments and what tactics are working for them.

When ChatGPT or Perplexity answers a question, it runs RAG: retrieves top candidates from a crawled index, then scores them. The scoring criteria are public knowledge from the Princeton GEO paper (arxiv.org/abs/2311.09735).

Key signals: answer directness, cited statistics, structured data (JSON-LD), crawl access, and content freshness.

What surprised me most in the research: schema markup alone shifts precise information extraction from 16% to 54%. That's not a marginal gain — that's the difference between being cited and being invisible.

Anyone else experimenting with this? Curious what's working for people here.

submitted by /u/esteban-vera
[link] [comments]