It was about 90 minutes into the session. We’d just finished building a RAG pipeline from scratch in Python the kind where you stare at FAISS indices and embeddings and wonder if you’ll ever actually deploy this in prod.
The instructor stopped scrolling through code and looked up.
“Alright,” he said. “Let’s stop pretending we’re building search. Let’s trace one live query through Perplexity. See what actually happens in the 3 seconds between you hitting enter and reading the answer.”
The room got quiet. Someone typed the question.
First, the Simple RAG Pattern (So We Have a Baseline)
If you’ve built any RAG system, you know the dance:
- Ingest — chunk documents, embed them, store in a vector DB
- Retrieve — embed the query, find the closest chunks
- Augment — inject those chunks into a prompt
- Generate — LLM answers, grounded in your docs
That’s the 40-line Python version. Clean. Predictable. Works great when your docs are a handful of policy PDFs.
But Perplexity isn’t doing that with 10 PDFs. It’s doing it with the entire internet, in real time, with a re-ranking step you won’t find in any beginner tutorial.
Here’s what we actually saw.
The Live Trace: One Query, Five Layers, Three Seconds
The query we used:
“What are the best practices for deploying LLMs in production?”
Not some toy question. Real. Ugly. Vague enough that a naive search would return 50 different blog posts with contradictory advice.
Layer 1 — Data Ingestion
The first thing that happens isn’t embedding. It’s crawling. Perplexity’s web crawler runs constantly, but for this query, it also fired off real-time searches to Bing and other indexes. Within milliseconds, it had 10–20 candidate web pages.
We could see the sources flash on screen: papers from Anthropic, blog posts from engineering teams at Uber and DoorDash, a surprisingly useful Hacker News thread from last week.
That last one matters: Perplexity isn’t using static training data. It’s pulling from this week.
Layer 2 — Embeddings
Now it gets messy. Each of those 10–20 pages gets chunked into paragraphs (300–500 words each). Each paragraph gets embedded. The query gets embedded.
This is the part everyone understands. But here’s where it diverges from the simple RAG tutorial:
Layer 3 — Vector Search + Re-ranking
FAISS (or whatever vector index they’re using) finds the top 20–50 chunks based on cosine similarity. That’s the cheap, fast pass.
This is the part that surprised everyone in the cohort, they run a second retrieval pass with a dedicated re-ranking model. Not the same embedding model. A separate model trained specifically to score relevance more precisely.
Why? Because vector search alone will surface things that are semantically close but practically useless. Re-ranking fixes that.
We watched as the top results shifted. A paragraph about deployment strategies from a Google engineering blog jumped from position 12 to position 3. A random forum post dropped out entirely.
Result: 5–8 highly relevant paragraphs. Not 50.
Layer 4 — Orchestration
This is where the prompt engineering happens, but not the kind you type yourself. Perplexity constructs a prompt that includes:
- System instructions for citation formatting
- The retrieved paragraphs (each tagged with its source URL)
- Your original query
The citations aren’t an afterthought. They’re part of the context, tracked through the entire orchestration layer so the final answer can link back to sources.
Layer 5 — LLM Generation
Finally, the model generates. In our trace, it was GPT-4o (though Perplexity switches models depending on the query type). The output came back with inline citations, a summary of key practices, and links to the original sources.
Total elapsed time: maybe 4 seconds. It felt instant.
What Surprised Us
Two things broke our assumptions.
First: the re-ranking step.
Everyone in the room thought Perplexity was just doing vector search + an LLM. None of us expected a dedicated re-ranking model. Once we saw it, it made sense, vector search alone is too noisy for production-grade answers. But it’s not something you see in the “build your own RAG” tutorials.
Second: the citation tracking.
This isn’t just “the model cites sources.” The orchestration layer is carefully constructed so the LLM has to include citations. The retrieved chunks come with metadata that the prompt instructs the model to reference. It’s not a nice-to-have. It’s a constraint baked into the architecture.
One person in the cohort said it best: “Oh, so this is RAG where the data source is the entire internet, and the retrieval is actually good.”
The Practical Lesson for Developers in 2026
Here’s what I took away, and it’s not “go build your own Perplexity.”
Understanding these pipelines the actual layers, not just the high-level “RAG is search + generation” is now table stakes. Even if you never deploy a vector database or train a re-ranking model, you need to know:
- Why your naive RAG prototype fails (because retrieval is noisy and you have no re-ranking)
- Why Perplexity’s answers feel better than ChatGPT with web search (because the orchestration layer is doing 10x more work than you think)
- Why citation tracking matters for trust, not just compliance
The product’s value isn’t in the model. It’s in the retrieval, the re-ranking, and the prompt construction. Perplexity uses the same models you have access to. What makes it work is everything before the LLM runs.
That’s the part you can’t see from the chat interface. And that’s the part worth understanding.
P.S. — If you’re curious, here’s the 5-layer breakdown we ended the session with:
| Layer | What Actually Happens |
|---|---|
| 1. Data Ingestion | Real-time web crawl + indexes, 10–20 candidate pages |
| 2. Embeddings | Chunk pages → embed chunks + query |
| 3. Vector Search + Re-ranking | FAISS-style fast retrieval → dedicated model re-ranks to top 5–8 |
| 4. Orchestration | Construct prompt with citations, source tracking, system instructions |
| 5. LLM Generation | Model writes grounded answer with inline links |
That’s what happens in the 3 seconds. Now you know.




