Two Root Causes Behind LLM "Hallucinations"
Anyone who has worked with large language models has run into these two situations:
Situation 1: Knowledge Cutoff
You: What were our company's Q1 sales figures?
GPT: I'm sorry, my training data only goes up to early 2024 and I have
no access to your company's internal data.
Situation 2: Hallucination
You: How do I use LangChain's RunnablePassthrough?
GPT: RunnablePassthrough can be enabled by calling .with_config(pass_through=True)...
(This parameter doesn't exist.)
Both problems share the same root cause: an LLM's knowledge is frozen at training time.
The moment training completes, the model's "memory" is locked in place. It has no idea what happened today, knows nothing about your internal documents, and won't go look things up—it can only answer from memory. When memory runs dry, it either admits ignorance or invents a plausible-sounding answer.
That's where hallucinations come from: the model uses fluent language to fill in the gaps in its knowledge.
Three Solutions and How They Compare
There are three engineering approaches to this problem:
| Approach | Mechanism | Best For | Cost |
|---|---|---|---|
| Fine-tuning | Retrain on new data, "bake" knowledge into parameters | Fixed-domain language style, output format | Expensive, slow to update, limited factual recall |
| Long Context | Stuff all documents into the prompt | Small document sets, one-off queries | Token cost grows exponentially; quality degrades at extreme lengths |
| RAG | Dynamically retrieve relevant content at query time, inject into prompt | Large knowledge bases, continuously updated data | Requires retrieval infrastructure |
A common misconception: fine-tuning is not good at injecting new facts.
Fine-tuning changes a model's behavioral patterns and language style—it doesn't "store a book" inside the parameters. Experiments consistently show that fine-tuning on specific Q&A pairs produces limited accuracy gains on related questions, and if training data contains errors, the model confidently repeats those errors.
RAG's core advantage is separating "what to know" from "how to say it":
- Knowledge lives in an external database and can be updated anytime
- The model focuses purely on understanding and generation, not memorization
When should you use long context instead of RAG?
When the total document volume is under ~100K tokens, the query is one-off (not recurring), and API costs are acceptable, long context is often simpler. Claude and Gemini's extended context windows make "stuffing a whole book in" genuinely viable. But for enterprise knowledge bases—thousands of documents, continuous updates, multiple concurrent users—RAG remains the more sensible architecture.
What RAG Is: An Open-Book Exam Analogy
RAG = Retrieval-Augmented Generation.
The most intuitive way to think about it: turning a closed-book exam into an open-book exam.
Closed-book (pure LLM): The student answers purely from memory. Anything not memorized gets guessed.
Open-book (RAG): The student can consult reference materials, but still needs to understand the question, find the relevant content, and compose the answer. The reference materials are the external knowledge base; looking things up is the retrieval step.
This analogy reveals two key properties of RAG:
- Knowledge lives outside the model — it can be swapped and updated independently
- The model handles understanding and generation — after retrieval, the model still needs to "read" the content and produce a coherent response
The Complete RAG Pipeline
RAG operates in two distinct phases: the indexing phase (one-time, offline) and the query phase (real-time, per request).
The two-phase RAG architecture — top: offline indexing pipeline; bottom: real-time query pipeline; both share the same Vector DB
Indexing Phase
This phase completes before any user query arrives. It's a one-time preprocessing step.
Raw Documents → Document Loading → Text Splitting → Embedding → Vector Database
Step 1: Document Loading
Convert raw content from various formats into plain text. PDFs, Word docs, Markdown, web pages, code—each format has its own parsing challenges (tables and images in PDFs are notoriously tricky).
Step 2: Text Splitting (Chunking)
Cut long documents into smaller chunks. This step has a significant impact on final quality—chunks too large reduce retrieval precision; chunks too small lose semantic coherence. (Chunking strategies are covered in depth in a later article; for now, just understand why we split.)
Step 3: Embedding
Use an embedding model to convert each text chunk into a high-dimensional vector. This vector captures the semantic meaning of the text—semantically similar texts produce vectors that are close together in the vector space.
Step 4: Store in Vector Database
Store all vectors along with their original text in a vector database that supports similarity search (Chroma, Qdrant, Weaviate, etc.).
Query Phase
Executed in real time for every user request.
User Question → Embedding → Similarity Search → Retrieved Chunks → Prompt Assembly → LLM → Answer
Step 1: Query Embedding
Convert the user's question into a vector using the same embedding model.
Step 2: Similarity Search
Find the Top-K most similar text chunks in the vector database. Similarity is measured by distance in vector space (cosine similarity, etc.).
Step 3: Prompt Assembly
Combine the retrieved text chunks with the user's question into a complete prompt and send it to the LLM. A typical format:
You are a knowledge assistant. Answer the user's question based solely on
the reference content provided below.
Reference content:
[Retrieved chunk 1]
[Retrieved chunk 2]
...
User question: [original question]
Please base your answer on the reference content. If the reference content
does not contain relevant information, say so clearly.
Step 4: LLM Generation
The LLM generates an answer grounded in the provided context, rather than from its internal parameters alone.
Hands-On: A Minimal RAG in 100 Lines
No frameworks—just the OpenAI API. Let's implement a working RAG from scratch. The goal is to see exactly what each step does, without the abstraction layers of a framework hiding the details.
"""
Minimal RAG implementation — no frameworks, OpenAI API only.
Demonstrates the complete RAG pipeline: indexing + querying.
"""
import numpy as np
from openai import OpenAI
client = OpenAI() # requires OPENAI_API_KEY environment variable
# ─────────────────────────────────────────
# Simulated knowledge base: 5 technical docs
# ─────────────────────────────────────────
DOCUMENTS = [
"LangChain is a framework for building LLM applications, providing chaining, memory management, and tool integration.",
"Vector databases enable semantic search by converting text into high-dimensional vectors. Common options include Chroma, Qdrant, Weaviate, and Pinecone.",
"RAG (Retrieval-Augmented Generation) reduces LLM hallucinations by retrieving relevant documents before generation, improving answer accuracy.",
"Embedding models convert text into fixed-dimension vectors, where semantically similar texts are positioned closer together in vector space.",
"Fine-tuning retrains a model on specific data to adjust its behavior — it's best suited for changing output style, not injecting new factual knowledge.",
]
# ─────────────────────────────────────────
# Indexing phase: convert documents to vectors
# ─────────────────────────────────────────
def build_index(documents: list[str]) -> list[dict]:
"""
Convert each document into a vector.
Returns [{text, embedding}, ...]
In production, store these in a vector database.
"""
print(f"Indexing {len(documents)} documents...")
index = []
for doc in documents:
response = client.embeddings.create(
model="text-embedding-3-small",
input=doc
)
embedding = response.data[0].embedding
index.append({"text": doc, "embedding": embedding})
print("Index built successfully.")
return index
def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
"""Cosine similarity between two vectors. Range: -1 to 1, higher = more similar."""
a = np.array(vec_a)
b = np.array(vec_b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# ─────────────────────────────────────────
# Query phase: retrieve + generate
# ─────────────────────────────────────────
def retrieve(query: str, index: list[dict], top_k: int = 2) -> list[str]:
"""
Embed the query and find the top_k most similar documents.
"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=query
)
query_embedding = response.data[0].embedding
scored = []
for doc in index:
score = cosine_similarity(query_embedding, doc["embedding"])
scored.append((score, doc["text"]))
scored.sort(key=lambda x: x[0], reverse=True)
return [text for _, text in scored[:top_k]]
def generate(query: str, context_docs: list[str]) -> str:
"""
Assemble retrieved docs + user question into a prompt and call the LLM.
"""
context = "
".join([f"- {doc}" for doc in context_docs])
prompt = f"""You are a knowledge assistant. Answer the user's question based
solely on the reference content below. If the reference content does not contain
relevant information, say so clearly — do not make anything up.
Reference content:
{context}
User question: {query}
Answer:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
def rag_query(query: str, index: list[dict]) -> str:
"""Full RAG query pipeline."""
print(f"
Question: {query}")
docs = retrieve(query, index, top_k=2)
print(f"Retrieved {len(docs)} relevant documents:")
for i, doc in enumerate(docs, 1):
print(f" [{i}] {doc[:60]}...")
answer = generate(query, docs)
print(f"
Answer: {answer}")
return answer
# ─────────────────────────────────────────
# Run the demo
# ─────────────────────────────────────────
if __name__ == "__main__":
# Build the index (only needs to be done once in a real system)
index = build_index(DOCUMENTS)
# Test with a few questions
rag_query("What is a vector database?", index)
rag_query("What's the difference between RAG and fine-tuning?", index)
rag_query("What is Python's GIL?", index) # Not in the knowledge base — testing refusal
Sample output:
Indexing 5 documents...
Index built successfully.
Question: What is a vector database?
Retrieved 2 relevant documents:
[1] Vector databases enable semantic search by converting text into high-dim...
[2] Embedding models convert text into fixed-dimension vectors, where semant...
Answer: A vector database is a database system that enables semantic search by
converting text into high-dimensional vectors. Common examples include Chroma,
Qdrant, Weaviate, and Pinecone...
Question: What is Python's GIL?
Retrieved 2 relevant documents:
[1] LangChain is a framework for building LLM applications...
[2] Fine-tuning retrains a model on specific data...
Answer: Based on the provided reference content, I cannot answer your question
about Python's GIL — the reference material does not contain relevant information.
Notice the last question: the knowledge base has nothing about Python's GIL, and the LLM explicitly says it can't answer rather than inventing a response. This is how RAG controls hallucinations: a constraint in the prompt instructs the model to answer only from retrieved content.
Limitations of This Implementation
The 100 lines above demonstrate the complete RAG pipeline, but there are obvious shortcomings:
| Problem | Cause | Engineering Solution |
|---|---|---|
| Vectors live in memory, lost on restart | No persistence | Vector database (Chroma / Qdrant) |
| Long documents passed in directly will exceed token limits | No chunking | Text Splitter strategies (next article) |
| Poor keyword matching; pure vector retrieval only | No hybrid search | Hybrid search (later in series) |
| No quality measurement | No evaluation | RAGAS evaluation framework (later in series) |
Each limitation maps directly to a topic covered in the upcoming articles.
Summary
This article addressed three core questions:
- Why RAG? — LLM knowledge cutoff and hallucinations both stem from knowledge being frozen in model parameters
- What is RAG? — Dynamically retrieve external knowledge at query time, inject it into the prompt, and let the LLM answer based on evidence
- RAG vs. alternatives — Fine-tuning changes behavior; long context works for small documents; RAG is built for large-scale, continuously-updated knowledge bases
Next up: the first deep dive into RAG's core components — text chunking strategies. Why does the chunking approach have such a dramatic impact on quality, and how do you choose between the four main strategies?
References
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Original RAG paper (Lewis et al., 2020)
- OpenAI Embeddings Documentation
- LangChain RAG Tutorial



